Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [17]:
%%capture
import sys
if 'google.colab' in sys.modules:
  !pip install category_encoders==2.*
  !pip install eli5

In [1]:
from google.colab import files
uploaded = files.upload()

Saving master.csv to master.csv


In [2]:
import pandas as pd 
import io
df = pd.read_csv(io.BytesIO(uploaded['master.csv']))
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [None]:
df.shape

(27820, 12)

I'm choosing my target as sex - male or female 

This problem will be classification - as it is **NOT** continuous like regression 

In [3]:
# My classification problem has 2 classes
y = df['sex']
y.nunique()

2

In [None]:
y.value_counts(normalize=True)

male      0.5
female    0.5
Name: sex, dtype: float64

**It is worth noting I may want to try a classification for the generations, or linear regression for percentages vs population, gdp, and age**

Given my classification type, I may use **Confusion Matrix** or **Precision and Recall** 

However, with both classes being equal, an **Accuracy score** may be VERY helpful 

In [None]:
# I think it may be interesting to train / validate / test based on generation 
# Boomers alone could easily be the validation set...
# We will need to do some research on wthe chronological order of the generations 

df['generation'].value_counts()

Generation X       6408
Silent             6364
Millenials         5844
Boomers            4990
G.I. Generation    2744
Generation Z       1470
Name: generation, dtype: int64

In [None]:
# Tired of scrolling...
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [4]:
# Cleaning data
# Let's just drop HDI for year, as there's a lot of nans 
df = df.drop(columns='HDI for year')
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,2156624900,796,Boomers


Given that country and year are already included, I will not include 'country-year' although I could likely just drop that as well....? I also don't foresee any columns offering up clairvoyant features. 

Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

Initially - I would like to choose accuracy score for my evaluation metric 

Let's try a first model - using a randomized train, validate, and test 

In [5]:
# Let's create our train, validation, test split 

from sklearn.model_selection import train_test_split
# Let's first split train and test 
train, test = train_test_split(df, test_size=0.20, random_state=42)
# And then split train and validate 
train, val = train_test_split(train, test_size=0.25, random_state=1)

In [None]:
# Checking sizes out of curiosity 

train.shape, val.shape, test.shape
#NOICE

((16692, 11), (5564, 11), (5564, 11))

In [6]:
# Define the majority class 

target = 'sex'
y_train = train[target]
# as seen above we already know the majority class to be evenly split with the 
# other class

majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)

In [7]:
from sklearn.metrics import accuracy_score 
accuracy_score(y_train, y_pred)
# We have about 50% accuracy score 

0.5008387251377906

In [8]:
# If we add validation
y_val = val[target]
y_pred = [majority_class] * len(y_val)
accuracy_score(y_val, y_pred)
# Our accuracy score now jumps to around 51%

0.5111430625449317

In [None]:
df['generation'].value_counts()

Generation X       6408
Silent             6364
Millenials         5844
Boomers            4990
G.I. Generation    2744
Generation Z       1470
Name: generation, dtype: int64

In [24]:
# For sh*ts and giggles, let's make a logistic regression 
# Arrange the x features matrices (already did target vectors)
features = ['generation', 'age', 'year']
# Because of the permutation importance, I will take out country
X_train = train[features]
X_val = val[features]
y_val = val[target]

# Pipeline
from sklearn.linear_model import LogisticRegression
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer 

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy = 'mean'),
    LogisticRegression(multi_class='auto', solver='lbfgs')
)

pipeline.fit(X_train, y_train)

print('Validation Accuracy', pipeline.score(X_val, y_val))

#My validation score is less than the baseline and accuracy score...  

Validation Accuracy 0.5111430625449317


In [32]:
 # What if we try a decision tree?

%%time 
from sklearn.ensemble import RandomForestClassifier
treepipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=0, n_jobs=-1)
)

treepipeline.fit(X_train, y_train)

print('Train Accuracy:', treepipeline.score(X_train, y_train))
print('Validation Accuracy', treepipeline.score(X_val, y_val))
# I like these numbers. No strange overfitting, but somehow the validation accuracy is worse

Train Accuracy: 0.524502755811167
Validation Accuracy 0.4618979151689432
CPU times: user 1.38 s, sys: 33.4 ms, total: 1.42 s
Wall time: 948 ms


*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


In [20]:
 # Let's get permutation importances
 from sklearn.ensemble import RandomForestClassifier

 transformers = make_pipeline(
     ce.OrdinalEncoder(),
     SimpleImputer(strategy='median')
 )

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model, 
    scoring='accuracy', 
    n_iter=5, 
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

eli5.show_weights(
    permuter, 
    top=None, # No limit: show permutation importances for all features
    feature_names=feature_names # must be a list
)

Weight,Feature
-0.1045  ± 0.0110,generation
-0.1468  ± 0.0045,age
-0.2818  ± 0.0067,year
-0.2951  ± 0.0183,country


# Model Interpretation

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Make at least 1 Shapley force plot to explain an individual prediction.
- [ ] **Share at least 1 visualization (of any type) on Slack!**

If you aren't ready to make these plots with your own dataset, you can practice these objectives with any dataset you've worked with previously. Example solutions are available for Partial Dependence Plots with the Tanzania Waterpumps dataset, and Shapley force plots with the Titanic dataset. (These datasets are available in the data directory of this repository.)

Please be aware that **multi-class classification** will result in multiple Partial Dependence Plots (one for each class), and multiple sets of Shapley Values (one for each class)