<a href="https://colab.research.google.com/github/sikoh/DS-Linear-Models/blob/main/Logistic-Regression/DS_LogisticRegression_Homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module Project: Logistic Regression

Do you like burritos? 🌯 You're in luck then, because in this project you'll create a model to predict whether a burrito is `'Great'`.

The dataset for this assignment comes from [Scott Cole](https://srcole.github.io/100burritos/), a San Diego-based data scientist and burrito enthusiast.

## Directions

The tasks for this project are the following:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function .
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build `model_logr` using a pipeline that includes three transfomers and `LogisticRegression` predictor. Train model on `X_train` and `X_test`.
- **Task 7:** Calculate the training and test accuracy score for your model.
- **Task 8:** Create a horizontal bar chart showing the 10 most influencial features for your  model.
- **Task 9:** Demonstrate and explain the differences between `model_lr.predict()` and `model_lr.predict_proba()`.

**Note**

You should limit yourself to the following libraries:

- `category_encoders`
- `matplotlib`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [None]:
!pip install category_encoders

In [52]:
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


In [5]:
def wrangle(filepath):
    # Import w/ DateTimeIndex
    df = pd.read_csv(filepath, parse_dates=['Date'],
                     index_col='Date')

    # Drop unrated burritos
    df.dropna(subset=['overall'], inplace=True)

    # Derive binary classification target:
    # We define a 'Great' burrito as having an
    # overall rating of 4 or higher, on a 5 point scale
    df['Great'] = (df['overall'] >= 4).astype(int)

    # Drop high cardinality categoricals
    df = df.drop(columns=['Notes', 'Location', 'Address', 'URL', 'Neighborhood'])

    # Drop columns to prevent "leakage"
    df = df.drop(columns=['Rec', 'overall'])

    return df


**Task 1:** Use the above `wrangle` function to import the `burritos.csv` file into a DataFrame named `df`.

In [6]:
filepath = "https://raw.githubusercontent.com/bloominstituteoftechnology/DS-Unit-2-Linear-Models/master/data/burritos/burritos.csv"
df = wrangle(filepath)
df

Unnamed: 0_level_0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,...,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-18,California,3.5,4.2,,6.49,3.0,,,,,...,,,,,,,,,,0
2016-01-24,California,3.5,3.3,,5.45,3.5,,,,,...,,,,,,,,,,0
2016-01-24,Carnitas,,,,4.85,1.5,,,,,...,,,,,,,,,,0
2016-01-24,Carne asada,,,,5.25,2.0,,,,,...,,,,,,,,,,0
2016-01-27,California,4.0,3.8,x,6.59,4.0,,,,,...,,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-27,Al Pastor,,,,6.00,1.0,,,17.0,20.5,...,,,,,,,,,,0
2019-08-27,Chile Relleno,,,,6.00,4.0,,,19.0,26.0,...,,,,,,,,,,1
2019-08-27,California,,,,7.90,3.0,,,20.0,22.0,...,,,,,,,,,,0
2019-08-27,Shrimp,,,,7.90,3.0,,,22.5,24.5,...,,,,,,,,,,1


In [7]:
df.isnull().sum()

Burrito             0
Yelp              334
Google            334
Chips             395
Cost                7
Hunger              3
Mass (g)          399
Density (g/mL)    399
Length            138
Circum            140
Volume            140
Tortilla            0
Temp               20
Meat               14
Fillings            3
Meat:filling        9
Uniformity          2
Salsa              25
Synergy             2
Wrap                3
Reviewer            1
Unreliable        388
NonSD             414
Beef              242
Pico              263
Guac              267
Cheese            262
Fries             294
Sour cream        329
Pork              370
Chicken           400
Shrimp            400
Fish              415
Rice              385
Beans             386
Lettuce           410
Tomato            414
Bell peper        414
Carrots           420
Cabbage           413
Sauce             383
Salsa.1           414
Cilantro          406
Onion             404
Taquito           417
Pineapple 

In [8]:
df.dtypes

Burrito            object
Yelp              float64
Google            float64
Chips              object
Cost              float64
Hunger            float64
Mass (g)          float64
Density (g/mL)    float64
Length            float64
Circum            float64
Volume            float64
Tortilla          float64
Temp              float64
Meat              float64
Fillings          float64
Meat:filling      float64
Uniformity        float64
Salsa             float64
Synergy           float64
Wrap              float64
Reviewer           object
Unreliable         object
NonSD              object
Beef               object
Pico               object
Guac               object
Cheese             object
Fries              object
Sour cream         object
Pork               object
Chicken            object
Shrimp             object
Fish               object
Rice               object
Beans              object
Lettuce            object
Tomato             object
Bell peper         object
Carrots     

During your exploratory data analysis, note that there are several columns whose data type is `object` but that seem to be a binary encoding. For example, `df['Beef'].head()` returns:

```
0      x
1      x
2    NaN
3      x
4      x
Name: Beef, dtype: object
```

**Task 2:** Change the `wrangle` function so that these columns are properly encoded as `0` and `1`s. Be sure your code handles upper- and lowercase `X`s, and `NaN`s.

In [None]:
df['Beef'].head()

Date
2016-01-18      x
2016-01-24      x
2016-01-24    NaN
2016-01-24      x
2016-01-27      x
Name: Beef, dtype: object

In [9]:
df.describe(exclude= 'number')

Unnamed: 0,Burrito,Chips,Reviewer,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,...,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
count,421,26,420,33,7,179,158,154,159,127,...,4,4,1,5,3,3,2,13,3,1
unique,132,4,106,1,2,2,2,2,2,2,...,1,1,1,1,1,1,1,1,2,1
top,California,x,Scott,x,x,x,x,x,x,x,...,x,x,x,x,x,x,x,x,x,x
freq,101,21,147,33,5,137,127,114,128,102,...,4,4,1,5,3,3,2,13,2,1


In [None]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.

In [29]:
# adding _ to the name of the modified function
def wrangle_(filepath):
    # Import w/ DateTimeIndex
    df = pd.read_csv(filepath, parse_dates=['Date'],
                     index_col='Date')

    # Drop unrated burritos
    df.dropna(subset=['overall'], inplace=True)

    # Derive binary classification target:
    # We define a 'Great' burrito as having an
    # overall rating of 4 or higher, on a 5 point scale
    df['Great'] = (df['overall'] >= 4).astype(int)

    # Drop high cardinality categoricals
    df = df.drop(columns=['Notes', 'Location', 'Address', 'URL', 'Neighborhood'])

    # Drop columns to prevent "leakage"
    df = df.drop(columns=['Rec', 'overall'])


    # Columns are properly encoded as 0 and 1s. Code handles upper- and lowercase Xs, and NaNs.
    for col in df.columns:
      if df[col].dtype == 'object' and df[col].str.contains('[Xx]').any():
        df[col] = df[col].replace({'X': 1,'x': 1, np.NaN: 0})

    return df

In [32]:
df = wrangle_(filepath)
df['Beef'].head()

Date
2016-01-18    1.0
2016-01-24    1.0
2016-01-24    0.0
2016-01-24    1.0
2016-01-27    1.0
Name: Beef, dtype: float64

If you explore the `'Burrito'` column of `df`, you'll notice that it's a high-cardinality categorical feature. You'll also notice that there's a lot of overlap between the categories.

**Stretch Goal:** Change the `wrangle` function above so that it engineers four new features: `'california'`, `'asada'`, `'surf'`, and `'carnitas'`. Each row should have a `1` or `0` based on the text information in the `'Burrito'` column. For example, here's how the first 5 rows of the dataset would look.

| **Burrito** | **california** | **asada** | **surf** | **carnitas** |
| :---------- | :------------: | :-------: | :------: | :----------: |
| California  |       1        |     0     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |
|  Carnitas   |       0        |     0     |    0     |      1       |
| Carne asada |       0        |     1     |    0     |      0       |
| California  |       1        |     0     |    0     |      0       |

**Note:** Be sure to also drop the `'Burrito'` once you've engineered your new features.

In [None]:
# Conduct your exploratory data analysis here
# And modify the `wrangle` function above.

In [33]:
df['Burrito']

Date
2016-01-18      California 
2016-01-24      California 
2016-01-24         Carnitas
2016-01-24      Carne asada
2016-01-27       California
                  ...      
2019-08-27        Al Pastor
2019-08-27    Chile Relleno
2019-08-27       California
2019-08-27           Shrimp
2019-08-27      Pollo Asado
Name: Burrito, Length: 421, dtype: object

In [36]:
# adding _2 to the name of the modified function
def wrangle_2(filepath):
    # Import w/ DateTimeIndex
    df = pd.read_csv(filepath, parse_dates=['Date'],
                     index_col='Date')

    # Drop unrated burritos
    df.dropna(subset=['overall'], inplace=True)

    # Derive binary classification target:
    # We define a 'Great' burrito as having an
    # overall rating of 4 or higher, on a 5 point scale
    df['Great'] = (df['overall'] >= 4).astype(int)

    # Drop high cardinality categoricals
    df = df.drop(columns=['Notes', 'Location', 'Address', 'URL', 'Neighborhood'])

    # Drop columns to prevent "leakage"
    df = df.drop(columns=['Rec', 'overall'])


    # Columns are properly encoded as 0 and 1s. Code handles upper- and lowercase Xs, and NaNs.
    for col in df.columns:
      if df[col].dtype == 'object' and df[col].str.contains('[Xx]').any():
        df[col] = df[col].replace({'X': 1,'x': 1, np.NaN: 0})


    # Engineer the new features
    df['california'] = df['Burrito'].apply(lambda x: 1 if 'California' in x else 0)
    df['asada'] = df['Burrito'].apply(lambda x: 1 if 'Asada' in x else 0)
    df['surf'] = df['Burrito'].apply(lambda x: 1 if 'Surf' in x else 0)
    df['carnitas'] = df['Burrito'].apply(lambda x: 1 if 'Carnitas' in x else 0)

    # Drop the 'Burrito' column
    df.drop('Burrito', axis=1, inplace=True)

    return df

In [37]:
df = wrangle_2(filepath)

In [38]:
df.head()

Unnamed: 0_level_0,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,...,Bacon,Sushi,Avocado,Corn,Zucchini,Great,california,asada,surf,carnitas
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-18,3.5,4.2,0,6.49,3.0,,,,,,...,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0
2016-01-24,3.5,3.3,0,5.45,3.5,,,,,,...,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0
2016-01-24,,,0,4.85,1.5,,,,,,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
2016-01-24,,,0,5.25,2.0,,,,,,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
2016-01-27,4.0,3.8,1,6.59,4.0,,,,,,...,0.0,0.0,0.0,0.0,0.0,1,1,0,0,0


# II. Split Data

**Task 3:** Split your dataset into the feature matrix `X` and the target vector `y`. You want to predict `'Great'`.

In [40]:
X = df.drop(columns='Great')
y = df['Great']

**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from 2016 through 2017.
- Your test set should include data from 2018 and later.

In [41]:
mask = (df.index >= '2016-01-01') & (df.index < '2018-01-01')

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

In [43]:
X_train.shape,y_train.shape, X_test.shape, y_test.shape

((381, 61), (381,), (40, 61), (40,))

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [51]:
baseline_acc = y_train.value_counts(normalize=True).max()

print('Baseline Accuracy Score:', round(baseline_acc*100,2),'%')

Baseline Accuracy Score: 58.27 %


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_logr`, and fit it to your training data. Your pipeline should include:

- a `OneHotEncoder` transformer for categorical features,
- a `SimpleImputer` transformer to deal with missing values,
- a [`StandarScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) transfomer (which often improves performance in a logistic regression model), and
- a `LogisticRegression` predictor.

In [78]:
model_logr = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'), # Fills NaN values with column Mean
    StandardScaler(), # re-scales all features (mean=0, std=1)
    LogisticRegression()
)

model_logr.fit(X_train,y_train);

# IV. Check Metrics

**Task 7:** Calculate the training and test accuracy score for `model_logr`.

In [75]:
print('Model Training Accuracy:', round(model_logr.score(X_train, y_train),4)*100 ,'%')
print('Model Testing Accuracy:', round(model_logr.score(X_test, y_test),4)*100 ,'%')

Model Training Accuracy: 97.11 %
Model Testing Accuracy: 80.0 %


# V. Communicate Results

**Task 8:** Create a horizontal barchart that plots the 10 most important coefficients for `model_lr`, sorted by absolute value.

**Note:** Since you created your model using a `Pipeline`, you'll need to use the [`named_steps`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) attribute to access the coefficients in your `LogisticRegression` predictor. Be sure to look at the shape of the coefficients array before you combine it with the feature names.

There is more than one way to generate predictions with `model_lr`. For instance, you can use [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression) or [`predict_proba`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression.predict_proba).

**Task 9:** Generate predictions for `X_test` using both `predict` and `predict_proba`. Then below, write a summary of the differences in the output for these two methods. You should answer the following questions:

- What data type do `predict` and `predict_proba` output?
- What are the shapes of their different output?
- What numerical values are in the output?
- What do those numerical values represent?

In [62]:
# Write code here to explore the differences between `predict` and `predict_proba`.
model_logr.predict(X_test)[:10]

array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])

In [63]:
model_logr.predict_proba(X_test)[:10]

array([[1.29943062e-05, 9.99987006e-01],
       [9.99998598e-01, 1.40192870e-06],
       [6.32277693e-05, 9.99936772e-01],
       [1.10667785e-03, 9.98893322e-01],
       [9.50616949e-01, 4.93830513e-02],
       [5.44219735e-05, 9.99945578e-01],
       [9.99404510e-01, 5.95490326e-04],
       [9.80855253e-01, 1.91447469e-02],
       [1.76405296e-02, 9.82359470e-01],
       [5.52594157e-05, 9.99944741e-01]])

**Give your written answer here:**

```


```
