Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

# Permuation 


In [None]:
import pandas as pd
df = pd.read_csv('yelp30kuser.csv')


In [None]:
# remind ourselves what we are trying to predict 
y = df['review_count']
import seaborn as sns 
sns.distplot(y);

In [None]:
def engineer(X):
    """A function to engineer the training, validation and test datasets in the same way"""
    # Making a copy as not to modify the original dataset 
    X = X.copy()
    
    # Format this column into a datetime type to extract year, month, and day 
    X['yelping_since'] = pd.to_datetime(X['yelping_since'])
    X['user_created_year'] = X['yelping_since'].dt.year
    X['user_created_month'] = X['yelping_since'].dt.month
    X['user_created_day']= X['yelping_since'].dt.day 
    X = X.drop(columns='yelping_since') # drop original 
    
    
    # these columns were found through permutation importances for random forest
    remove = ['fans', 'elite', 'compliment_writer',
              'name', 'compliment_photos', 'user_created_year']
    
    
    X = X.drop(columns=remove)
    # Convert this column from a float into an integer value
    # Since floats cannot be used as targets in a model 
#     X["target_star"] = X['average_stars'].astype(int)
    
    # X['review_count_bin'] = pd.qcut(X['review_count'], q=10, duplicates='drop')
    
    # There's no spaces in the column names but this code might be useful anyway                                
    X.columns = [col.replace(' ', '_') for col in X]
    
    return X

In [None]:
# Engineer the data to work with plots 
import plotly.express as px
df_engineered = engineer(df)
df_engineered.dtypes

In [None]:
# Split the dataframe into training and validation sets 
from sklearn.model_selection import train_test_split
training, validation = train_test_split(df, test_size =0.1, shuffle=True, random_state=42)
training.shape, validation.shape

In [None]:

# Engineer and separate X and y 
train = engineer(training)
val = engineer(validation)

target = "review_count"

X_train = train.drop(columns=target)
y_train = train[target]

X_val = val.drop(columns=target)
y_val = val[target]

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(n_jobs=-1)

# Fit on train, score on val
model.fit(X_train_transformed, y_train)

print('Training Accuracy', model.score(X_train_transformed, y_train))
print('Validation Accuracy', model.score(X_val_transformed, y_val))

In [None]:
column  = 'average_stars'

# Fit without column
pipeline = make_pipeline(
     ce.OrdinalEncoder(), 
     SimpleImputer(strategy='median'), 
    LinearRegression()
)
pipeline.fit(X_train.drop(columns=column), y_train)
score_without = pipeline.score(X_val.drop(columns=column), y_val)
print(f'Validation Accuracy without {column}: {score_without}')

# Fit with column
pipeline = make_pipeline(
     ce.OrdinalEncoder(), 
     SimpleImputer(strategy='median'), 
    LinearRegression()
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_val, y_val)
print(f'Validation Accuracy with {column}: {score_with}')

# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')

In [None]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Random Forest Validation Accuracy', pipeline.score(X_val, y_val))

In [None]:
train['Unnamed:_0'].head()

In [None]:
# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

# Plot feature importances
%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey');

In [None]:

for column in X_train.columns:
    # Fit without column
    pipeline = make_pipeline(
        ce.OrdinalEncoder(), 
        SimpleImputer(strategy='median'), 
        RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    )
    pipeline.fit(X_train.drop(columns=column), y_train)
    score_without = pipeline.score(X_val.drop(columns=column), y_val)
    print(f'Validation Accuracy without {column}: {score_without}')

    # Fit with column
    pipeline = make_pipeline(
        ce.OrdinalEncoder(), 
        SimpleImputer(strategy='median'), 
        RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    )
    pipeline.fit(X_train, y_train)
    score_with = pipeline.score(X_val, y_val)
    print(f'Validation Accuracy with {column}: {score_with}')

    # Compare the error with & without column
    print(f'Drop-Column Importance for {column}: {score_with - score_without}')

In [None]:
select = ['Unnamed_0', 'name', 'useful',
                           'elite', 'average_stars', 'compliment_cute',
                           'compliment_plain', 'compliment_funny',
                           'compliment_write', 'user_created_month', 'user_created_day']
X_train, X_val = pd.DataFrame(X_train, columns=select), pd.DataFrame(X_val, columns=select)

In [None]:
# Prepare for Permutation importances 
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.impute import SimpleImputer

transformers = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)


In [None]:
X_train.shape, X_val.shape

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(n_jobs=-1)

# No difference
model.fit(X_train_transformed, y_train)

print('Training Accuracy', model.score(X_train_transformed, y_train))
print('Validation Accuracy', model.score(X_val_transformed, y_val))

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model, 
    scoring='accuracy', 
    n_iter=5, 
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

In [None]:
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

In [None]:
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter,
    top=None, # show permutation importances for all features
    feature_names=feature_names
)

In [None]:
print('Shape before removing features:', X_train.shape)

In [None]:
minimum_importance = 0
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train = X_train[features]

In [None]:
X_val = X_val[features]

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)

# this is better for decision tree, this indicates that dropping colums that improve the decision
# decision tree/random forest do not improve the accuracy for linear regression 
# which makes sense intuitively I guess 
print('Validation Accuracy', pipeline.score(X_val, y_val))

# The Random Forest score improved with dropping negative importance columns. 
When I tried dropping the same columns from the Linear Regression the score didn't improve, which makes sense intuitively since a straight line isn't going to be effected by small differences like a decision tree will.  

In [None]:
from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    XGBClassifier(n_estimators=10, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

# Gradient Boosting is significantly more accurate than Random Forest 

In [1]:
# Split the dataframe into training and validation sets 
import pandas as pd
df = pd.read_csv('yelp30kuser.csv')

def engineer(X):
    """A function to engineer the training, validation and test datasets in the same way"""
    # Making a copy as not to modify the original dataset 
    X = X.copy()
    
    # Format this column into a datetime type to extract year, month, and day 
    X['yelping_since'] = pd.to_datetime(X['yelping_since'])
    X['user_created_year'] = X['yelping_since'].dt.year
    X['user_created_month'] = X['yelping_since'].dt.month
    X['user_created_day']= X['yelping_since'].dt.day 
    X = X.drop(columns='yelping_since') # drop original 
    
    
    # these columns were found through permutation importances for random forest
    remove = ['fans', 'elite', 'compliment_writer',
              'name', 'compliment_photos', 'user_created_year']
    
    
    X = X.drop(columns=remove)
    # Convert this column from a float into an integer value
    # Since floats cannot be used as targets in a model 
#     X["target_star"] = X['average_stars'].astype(int)
    
    # X['review_count_bin'] = pd.qcut(X['review_count'], q=10, duplicates='drop')
    
    # There's no spaces in the column names but this code might be useful anyway                                
    X.columns = [col.replace(' ', '_') for col in X]
    
    return X

from sklearn.model_selection import train_test_split
training, validation = train_test_split(df, test_size =0.1, shuffle=True, random_state=42)
training.shape, validation.shape


((27000, 23), (3000, 23))

In [2]:
# Engineer and separate X and y 
train = engineer(training)
val = engineer(validation)

target = "review_count"

X_train = train.drop(columns=target)
y_train = train[target]

X_val = val.drop(columns=target)
y_val = val[target]




In [6]:
import category_encoders as ce
from xgboost import XGBClassifier

encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_train_transformed  = encoder.transform(X_val)

model = XGBClassifier(
    n_estimators=1000, # <= 1000 trees, depends on early stopping
    max_depth=7,       # try deeper trees because of high cardinality categoricals
    learning_rate=0.5, # try higher learning rate
    n_jobs=-1
)

eval_set = [(X_train_transformed , y_train), 
            (X_train_transformed , y_val)]

model.fit(X_train_transformed , y_train, 
          eval_set=eval_set, 
          eval_metric='merror', 
          early_stopping_rounds=20)

ValueError: y contains previously unseen labels: [236, 237, 268, 289, 314, 330, 590, 620, 729, 1295, 1364]