<a href="https://colab.research.google.com/github/taycurran/DS-Unit-2-Linear-Models/blob/master/CURRAN_M4_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [X] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [X] Begin with baselines for classification.
- [X] Use scikit-learn for logistic regression.
- [X] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [X] Get your model's test accuracy. (One time, at the end.)
- [X] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [X] Do one-hot encoding.
- [X] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

## MY CODE STARTS HERE

In [151]:
# Convert Date Column to DateTime DatatType
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df['Date'].head(2)

0   2016-01-18
1   2016-01-24
Name: Date, dtype: datetime64[ns]

In [152]:
# Establish Separation Conditions
cond_train = df['Date'] < '2017-01-01'
cond_validate = (df['Date'] >= '2017-01-01') & (df['Date'] < '2018-01-01')
cond_test = df['Date'] >= '2018-01-01'

# Execute Separation
train = df[cond_train]
validate = df[cond_validate]
test = df[cond_test]

print('Separated DF Shapes')
print(f"Train: {train.shape}")
print(f"Validate: {validate.shape}")
print(f"Test: {test.shape}")

Separated DF Shapes
Train: (298, 59)
Validate: (85, 59)
Test: (38, 59)


## **Test Split**

In [153]:
# Establish Separation Conditions
cond_test = df['Date'] >= '2018-01-01'

# Execute Separation
test = df[cond_test]

print('Separated Test Shape')
print(f"Test: {test.shape}")

Separated Test Shape
Test: (38, 59)


In [0]:
# Change the DataFrame to EXCLUDE Test Data
condN_test = ~cond_test
df = df[condN_test]

### **Wrangling**
#### I am performing data exploration on Train+Validation sets together excluding Test Data

Encode Categorical Variables

In [0]:
# Drop Columns with Less or No Data
droppedCols = ['Mass (g)', 'Density (g/mL)', 'Queso', 'Ham', 'Lobster',
               'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
       
df = df.drop(droppedCols, axis=1)
test = test.drop(droppedCols, axis=1)

In [156]:
# Check to Ensure Changes to Columns are Consistent
df.columns == test.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

## **Train/Validate Split**

In [157]:
# Establish Separation Conditions
cond_train = df['Date'] < '2017-01-01'
cond_validate = (df['Date'] >= '2017-01-01') & (df['Date'] < '2018-01-01')

# Execute Separation
train = df[cond_train]
validate = df[cond_validate]

print('Separated DF Shapes')
print(f"Train: {train.shape}")
print(f"Validate: {validate.shape}")

Separated DF Shapes
Train: (298, 48)
Validate: (85, 48)


In [0]:
# Drop Date Column
train = train.drop(['Date'], axis=1)
validate = validate.drop(['Date'], axis=1)
test = test.drop(['Date'], axis=1)

In [0]:
# Establish X and y Feature Names
target = 'Great'
features = train.columns.drop(target)

# Establish X Matrices and y Vectors
X_train = train[features]
y_train = train[target]

X_val = validate[features]
y_val = validate[target]

X_test = test[features]
y_test = test[target]

In [0]:
# Useful Variables
trainSize = len(y_train)
valSize = len(y_val)
testSize = len(y_test)

### More Wrangling

In [0]:
import category_encoders as ce

# Execute OneHotEncoder
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)
X_test_encoded = encoder.transform(X_test)

In [0]:
# Take Care of NaN Values
from sklearn.impute import SimpleImputer

# Execute SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

In [0]:
# Normalize Data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

## **Begin with the Baseline**

In [164]:
# Around 41 Percent of Burritos Are Great
percGreat = y_train.value_counts()[1]/trainSize
print(f"Length y_train: {trainSize}")
print(f"Percent of Great Burr: {percGreat}")
print("Majority = False")
print("\nValue Counts")
y_train.value_counts()

Length y_train: 298
Percent of Great Burr: 0.40939597315436244
Majority = False

Value Counts


False    176
True     122
Name: Great, dtype: int64

In [165]:
# Make Baseline Prediction from Train Mode (Majority)
basePredMode = pd.Series([False] * valSize)
print(f"Val Size: {valSize}")
print("basePredMode Value_Counts")
basePredMode.value_counts()

Val Size: 85
basePredMode Value_Counts


False    85
dtype: int64

In [166]:
# Make Baseline Prediction from Train Percentage
GreatBase = int(round(percGreat*valSize)) * [True] 
nGreatBase = int(round((1-percGreat)*valSize)) * [False]
basePredPerc = GreatBase + nGreatBase
basePredPerc = pd.Series(basePredPerc)
print("basePredPerc Value_Counts")
basePredPerc.value_counts()

basePredPerc Value_Counts


False    50
True     35
dtype: int64

In [167]:
from sklearn.metrics import accuracy_score

# Evaluate Baseline Accuracy from Percentage Prediction
basePercScore = accuracy_score(y_val, basePredPerc)

# Evaluate Baseline Accuracy from Percentage Prediction
baseModeScore = accuracy_score(y_val, basePredMode)

print("Baseline Accuracy Scores")
basePercScore, baseModeScore

Baseline Accuracy Scores


(0.5411764705882353, 0.5529411764705883)

## **Logistic Regression**

In [168]:
from sklearn.linear_model import LogisticRegression

# Instantiate the LogisticRegression Tool
model = LogisticRegression()

# Train the Model
model.fit(X_train_scaled, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## **Validation Accuracy**

In [169]:
print("Validation Accuracy", model.score(X_val_scaled, y_val))

Validation Accuracy 0.7647058823529411


## **Test Accuracy**

In [170]:
print("Test Accuracy", model.score(X_test_scaled, y_test))

Test Accuracy 0.7631578947368421


## **Conclusion**

Validation Accuracy and Test Accuracy are Close in Value.
This indicates that the model has low variance.
The models accuracy is 50% higher than base estimates indicating the model has less bias than max bias.