Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [4]:
!pip install category_encoders==2.*



In [5]:
import pandas as pd
import numpy as np
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

source_file1 = '/content/redacted_sales_data.csv'
df = pd.read_csv(source_file1)
df.tail(3)

Unnamed: 0,Name,Price,Tax,Total Price,Total Paid,Terminal,User,Date
25151,,,,,,,,
25152,,,,,,,,
25153,,928084.05,173078.94,1101162.99,1101162.99,,,


In [6]:
df.head()

Unnamed: 0,Name,Price,Tax,Total Price,Total Paid,Terminal,User,Date
0,,65.16,11.66,76.82,76.82,Register 1 - Retail,E5,01/01/2019 8:34 AM
1,,70.0,12.53,82.53,82.53,Register 1 - Retail,E5,01/01/2019 8:55 AM
2,,16.96,3.04,20.0,20.0,Register 1 - Retail,E5,01/01/2019 8:57 AM
3,,20.0,3.58,23.58,23.58,Register 1 - Retail,E5,01/01/2019 8:59 AM
4,,8.48,1.52,10.0,10.0,Register 1 - Retail,E5,01/01/2019 10:00 AM


In [0]:
# the column names have a trailing space, this removes it
df = df.rename(columns={'Name ':'Name', 'Price ':'Price', 'Tax ':'Tax',
                      'Total Price ':'Total Price', 'Total Paid ':'Total Paid',
                      'Terminal ':'Terminal', 'User ':'User', 'Date ':'Date'})

In [0]:
### drop name column
df = df.drop(['Name'], axis=1)

In [9]:
## drop the last 12 rows
df.tail(13)

Unnamed: 0,Price,Tax,Total Price,Total Paid,Terminal,User,Date
25141,50.27,9.89,60.16,60.16,Register 1 - Retail,E2,06/02/2019 1:48 PM
25142,,,,,,,
25143,,,,,,,
25144,,,,,,,
25145,,,,,,,
25146,,,,,,,
25147,,,,,,,
25148,,,,,,,
25149,,,,,,,
25150,,,,,,,


In [0]:
df = df.dropna(axis=0)

In [11]:
df.tail()

Unnamed: 0,Price,Tax,Total Price,Total Paid,Terminal,User,Date
25137,0.27,2.73,3.0,3.0,Register 1 - Retail,E2,06/02/2019 1:27 PM
25138,139.94,25.05,164.99,164.99,Register 1 - Retail,E7,06/02/2019 1:34 PM
25139,20.0,3.58,23.58,23.58,Register 1 - Retail,E7,06/02/2019 1:36 PM
25140,150.26,26.9,177.16,177.16,12,E2,06/02/2019 1:38 PM
25141,50.27,9.89,60.16,60.16,Register 1 - Retail,E2,06/02/2019 1:48 PM


In [0]:
## drop total paid because it's redundant
## and drop terminal because it's not informative
## rename "User" column for clarity

df = df.drop(['Total Paid'], axis=1)
df = df.drop(['Terminal'], axis=1)
df = df.rename(columns={'Total Price':'Total'})
df = df.rename(columns={'User':'Employee'})

In [13]:
df.head()

Unnamed: 0,Price,Tax,Total,Employee,Date
0,65.16,11.66,76.82,E5,01/01/2019 8:34 AM
1,70.0,12.53,82.53,E5,01/01/2019 8:55 AM
2,16.96,3.04,20.0,E5,01/01/2019 8:57 AM
3,20.0,3.58,23.58,E5,01/01/2019 8:59 AM
4,8.48,1.52,10.0,E5,01/01/2019 10:00 AM


In [0]:
#TIME to choose a target!!!

In [15]:
df['Total'].describe()

count    25142.000000
mean        43.797748
std         40.177104
min          0.010000
25%         18.000000
50%         30.010000
75%         55.990000
max        517.480000
Name: Total, dtype: float64

In [16]:
df.isna().sum()

Price       0
Tax         0
Total       0
Employee    0
Date        0
dtype: int64

In [0]:
df["Above ATP"] = df["Total"] >= df.Total.mean()

In [0]:

df['Date'] = pd.to_datetime(df['Date'])
df['Week'] = df['Date'].dt.week
df['Day'] = df['Date'].dt.day
df['Hour'] = df['Date'].dt.hour
df['Minute'] = df['Date'].dt.minute
df['Second'] = df['Date'].dt.second

In [0]:
df['Month'] = df['Date'].dt.month

In [0]:
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['Date']
    df = df.drop(columns=unusable_variance)

In [21]:
df.dtypes

Price        float64
Tax          float64
Total        float64
Employee      object
Above ATP       bool
Week           int64
Day            int64
Hour           int64
Minute         int64
Second         int64
Month          int64
dtype: object

In [22]:
df.nunique().value_counts()

31      1
12      1
2092    1
10      1
6       1
22      1
4004    1
60      1
2       1
1       1
3984    1
dtype: int64

In [23]:
df['Month'].value_counts()

5    5609
4    5457
3    5357
2    4322
1    4112
6     285
Name: Month, dtype: int64

In [0]:
train = df[df['Month'] <= 3]
val = df[df['Month'] == 4]
test = df[df['Month'] >= 5]

In [25]:
'''
!pip install category_encoders==2.*
'''

'\n!pip install category_encoders==2.*\n'

In [26]:
# The status_group column is the target
target = 'Above ATP'

# Get a dataframe with all train columns except the target & Date
features = train.columns.drop([target])

print(features)

Index(['Price', 'Tax', 'Total', 'Employee', 'Week', 'Day', 'Hour', 'Minute',
       'Second', 'Month'],
      dtype='object')


In [0]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test
y_val = test[target]

In [0]:
transformers = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer()
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val_permuted)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

In [33]:

#column  = train.columns.drop([target])

#X_train_t = X_train.drop(columns=column)

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_without = pipeline.score(X_val.drop, y_val)
print(f'Validation Accuracy without {column}: {score_without}')

# Fit with column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_val, y_val)
print(f'Validation Accuracy with {column}: {score_with}')

# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')

ValueError: ignored

In [0]:
'''import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)'''