<a href="https://colab.research.google.com/github/urenajose/DS-Unit-2-Regression-1/blob/master/Jose_Urena_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [7]:
print(df.shape)

(421, 59)


In [8]:
display(df.head(),df.tail())

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
418,Other,8/27/2019,,,,6.0,1.0,,,17.0,20.5,0.57,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
419,Other,8/27/2019,,,,6.0,4.0,,,19.0,26.0,1.02,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
420,California,8/27/2019,,,,7.9,3.0,,,20.0,22.0,0.77,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
421,Other,8/27/2019,,,,7.9,3.0,,,22.5,24.5,1.07,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
422,Other,8/27/2019,,,,5.5,3.5,,,17.0,21.3,0.61,3.0,5.0,4.3,4.0,4.9,3.8,3.0,4.5,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


###df info

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Date            421 non-null    object 
 2   Yelp            87 non-null     float64
 3   Google          87 non-null     float64
 4   Chips           26 non-null     object 
 5   Cost            414 non-null    float64
 6   Hunger          418 non-null    float64
 7   Mass (g)        22 non-null     float64
 8   Density (g/mL)  22 non-null     float64
 9   Length          283 non-null    float64
 10  Circum          281 non-null    float64
 11  Volume          281 non-null    float64
 12  Tortilla        421 non-null    float64
 13  Temp            401 non-null    float64
 14  Meat            407 non-null    float64
 15  Fillings        418 non-null    float64
 16  Meat:filling    412 non-null    float64
 17  Uniformity      419 non-null    flo

##nulls

In [10]:
 df_null = pd.DataFrame(df.isnull().sum())
 df_null.head(3)

Unnamed: 0,0
Burrito,0
Date,0
Yelp,334


In [11]:
df_low_nulls = df_null[(df_null[0] < (421*.3))].index.values

In [12]:
df_new = df[df_low_nulls]
df_new

Unnamed: 0,Burrito,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,False
1,California,1/24/2016,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,False
2,Carnitas,1/24/2016,4.85,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,False
3,Asada,1/24/2016,5.25,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,False
4,California,1/27/2016,6.59,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418,Other,8/27/2019,6.00,1.0,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,False
419,Other,8/27/2019,6.00,4.0,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,True
420,California,8/27/2019,7.90,3.0,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,False
421,Other,8/27/2019,7.90,3.0,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,True


In [13]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Burrito       421 non-null    object 
 1   Date          421 non-null    object 
 2   Cost          414 non-null    float64
 3   Hunger        418 non-null    float64
 4   Tortilla      421 non-null    float64
 5   Temp          401 non-null    float64
 6   Meat          407 non-null    float64
 7   Fillings      418 non-null    float64
 8   Meat:filling  412 non-null    float64
 9   Uniformity    419 non-null    float64
 10  Salsa         396 non-null    float64
 11  Synergy       419 non-null    float64
 12  Wrap          418 non-null    float64
 13  Great         421 non-null    bool   
dtypes: bool(1), float64(11), object(2)
memory usage: 46.5+ KB


In [14]:
df_new["Burrito"].value_counts()

California     169
Other          156
Asada           43
Surf & Turf     28
Carnitas        25
Name: Burrito, dtype: int64

In [15]:
df_new["Great"] = df['Great'].replace([False,True],[0,1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [16]:
df_new.head()

Unnamed: 0,Burrito,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0
1,California,1/24/2016,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0
2,Carnitas,1/24/2016,4.85,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0
3,Asada,1/24/2016,5.25,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0
4,California,1/27/2016,6.59,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1


In [17]:
import datetime as dt

In [18]:
df_new['Date'] = df_new['Date'].astype('datetime64[ns]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [19]:
df_new['Year'] = df_new['Date'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [20]:
# df_new1 = df_new.drop(["Date","Year"])

In [21]:
df_train = df_new[(df_new['Year'] == 2016)]
df_validate = df_new[(df_new['Year'] == 2017)]
df_test = df_new[(df_new['Year'] >= 2018 )]

In [22]:
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)

  import pandas.util.testing as tm


In [23]:
print(f'''Base line for Train Great \n{df_train["Great"].value_counts(normalize=True)} \n'''
f'''Base line for validate Great \n{df_validate["Great"].value_counts(normalize=True)} \n''' 
f'''Base line for test Great \n{df_test["Great"].value_counts(normalize=True)}''')

Base line for Train Great 
0    0.591216
1    0.408784
Name: Great, dtype: float64 
Base line for validate Great 
0    0.552941
1    0.447059
Name: Great, dtype: float64 
Base line for test Great 
1    0.578947
0    0.421053
Name: Great, dtype: float64


In [24]:

x_train = encoder.fit_transform(df_train)

In [25]:
x_validate = encoder.transform(df_validate)

In [26]:
x_test = encoder.transform(df_test)

In [27]:
train_target = 'Great'
y_train = x_train[train_target]

In [28]:
x_train.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great,Year
0,1,0,0,0,0,2016-01-18,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0,2016
1,1,0,0,0,0,2016-01-24,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0,2016
2,0,1,0,0,0,2016-01-24,4.85,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,2016
3,0,0,1,0,0,2016-01-24,5.25,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0,2016
4,1,0,0,0,0,2016-01-27,6.59,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1,2016


In [29]:
print(x_train['Date'].isnull().sum())
print(x_train['Year'].isnull().sum())

0
0


##Droping the Date, Target, Year.

In [35]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
x_train_imputed = imputer.fit_transform(x_train.drop(['Date','Great','Year'],axis=1))

In [36]:
x_validate_imputed = imputer.transform(x_validate.drop(['Date','Great','Year'],axis=1))

In [37]:
x_test_imputed = imputer.transform(x_validate.drop(['Date','Great','Year'],axis=1))

In [38]:
from sklearn.preprocessing import StandardScaler
burrito_scaler = StandardScaler()

In [39]:
from sklearn.linear_model import LogisticRegressionCV
burrito_model = LogisticRegressionCV()

In [None]:
x_train_scaled