<a href="https://colab.research.google.com/github/spentaur/DS-Unit-2-Regression-Classification/blob/master/module4/assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4

## Assignment

- [ ] Watch Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your coefficients.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.


## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from the previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```

#### Pipelines

[Scikit-Learn User Guide](https://scikit-learn.org/stable/modules/compose.html) explains why pipelines are useful, and demonstrates how to use them:

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:
> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

### Reading
- [ ] [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)
- [ ] [Always start with a stupid model, no exceptions](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa)
- [ ] [Statistical Modeling: The Two Cultures](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).



In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module4')

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 2.2MB/s 
[?25hCollecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 2.3MB/s 
[?25hRequirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741483f53b66d3d3afd987f20fa227/phik-0.9.8-py3-

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd

train_features = pd.read_csv('../data/tanzania/train_features.csv')
train_labels = pd.read_csv('../data/tanzania/train_labels.csv')
test_features = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

# feature enginering

In [0]:
train_with_labels = train_features.merge(train_labels)

In [0]:
# import pandas_profiling
# train_features.profile_report()

In [0]:
extraction = ['extraction_type',
              'extraction_type_group',
              'extraction_type_class']

management = ['management', 'management_group']

payment = ['payment', 'payment_type']

quality = ['quantity',
           'quantity_group']

region = ['region',
          'region_code']

source = ['source', 'source_type', 'source_class']

waterpoint = ['waterpoint_type', 'waterpoint_type_group']

In [7]:
for cat in extraction:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))


# let's go with first
# extraction_type

status_group               functional  functional needs repair  non functional
extraction_type                                                               
afridev                      0.677966                 0.023729        0.298305
nira/tanira                  0.664827                 0.078612        0.256561
other - rope pump            0.649667                 0.037694        0.312639
india mark ii                0.603333                 0.032917        0.363750
gravity                      0.599253                 0.100859        0.299888
swn 80                       0.569482                 0.057766        0.372752
submersible                  0.551217                 0.047649        0.401134
other - swn 81               0.524017                 0.030568        0.445415
cemo                         0.500000                 0.100000        0.400000
ksb                          0.496820                 0.018375        0.484806
walimi                       0.479167               

In [8]:
for cat in management:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# first again
# management

status_group      functional  functional needs repair  non functional
management                                                           
private operator    0.748858                 0.022324        0.228818
water board         0.739857                 0.090351        0.169792
wua                 0.690730                 0.080868        0.228402
wug                 0.599540                 0.099002        0.301458
other               0.598341                 0.065166        0.336493
trust               0.589744                 0.076923        0.333333
parastatal          0.576923                 0.119344        0.303733
vwc                 0.504234                 0.068902        0.426864
water authority     0.493363                 0.057522        0.449115
unknown             0.399287                 0.048128        0.552585
company             0.389781                 0.021898        0.588321
other - school      0.232323                 0.010101        0.757576
status_group      fu

In [9]:
for cat in payment:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# i'll just have to try with each one and see which one is better

status_group           functional  functional needs repair  non functional
payment                                                                   
pay annually             0.752334                 0.067820        0.179846
pay per bucket           0.677796                 0.045520        0.276683
pay monthly              0.660482                 0.111687        0.227831
pay when scheme fails    0.620593                 0.070772        0.308636
other                    0.579696                 0.111954        0.308349
never pay                0.448911                 0.075233        0.475856
unknown                  0.432512                 0.052961        0.514527
status_group  functional  functional needs repair  non functional
payment_type                                                     
annually        0.752334                 0.067820        0.179846
per bucket      0.677796                 0.045520        0.276683
monthly         0.660482                 0.111687        0.22

In [10]:
all(train_with_labels['payment'] == train_with_labels['payment_type'])

False

In [11]:
for cat in quality:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# repeat just use quality

status_group  functional  functional needs repair  non functional
quantity                                                         
enough          0.652323                 0.072320        0.275357
seasonal        0.574074                 0.102716        0.323210
insufficient    0.523234                 0.095842        0.380924
unknown         0.269962                 0.017744        0.712294
dry             0.025136                 0.005924        0.968940
status_group    functional  functional needs repair  non functional
quantity_group                                                     
enough            0.652323                 0.072320        0.275357
seasonal          0.574074                 0.102716        0.323210
insufficient      0.523234                 0.095842        0.380924
unknown           0.269962                 0.017744        0.712294
dry               0.025136                 0.005924        0.968940


In [12]:
all(train_with_labels['quantity'] == train_with_labels['quantity_group'])

True

In [13]:
for cat in region:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# region code, but maybe map so it's linear?

status_group   functional  functional needs repair  non functional
region                                                            
Iringa           0.782206                 0.023234        0.194560
Arusha           0.684776                 0.052239        0.262985
Manyara          0.623500                 0.060644        0.315856
Kilimanjaro      0.602877                 0.073533        0.323590
Pwani            0.590512                 0.013662        0.395825
Dar es Salaam    0.572671                 0.003727        0.423602
Tanga            0.563801                 0.028661        0.407538
Ruvuma           0.560606                 0.062121        0.377273
Shinyanga        0.559815                 0.127459        0.312726
Morogoro         0.528957                 0.074888        0.396156
Kagera           0.520808                 0.091677        0.387515
Mbeya            0.499892                 0.108644        0.391464
Mwanza           0.484204                 0.058994        0.45

In [14]:
for cat in source:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# go with source

status_group          functional  functional needs repair  non functional
source                                                                   
spring                  0.622290                 0.074966        0.302744
rainwater harvesting    0.603922                 0.136819        0.259259
other                   0.594340                 0.004717        0.400943
hand dtw                0.568650                 0.019451        0.411899
river                   0.568560                 0.127029        0.304411
shallow well            0.494769                 0.056883        0.448348
machine dbh             0.489571                 0.044334        0.466095
unknown                 0.484848                 0.060606        0.454545
dam                     0.385671                 0.036585        0.577744
lake                    0.211765                 0.015686        0.772549
status_group          functional  functional needs repair  non functional
source_type                           

In [15]:
for cat in waterpoint:
    print(pd.crosstab(train_with_labels[cat], train_with_labels['status_group'], normalize='index').sort_values('functional', ascending=False))

# waterpoint_type

status_group                 functional  ...  non functional
waterpoint_type                          ...                
dam                            0.857143  ...        0.142857
cattle trough                  0.724138  ...        0.258621
improved spring                0.718112  ...        0.173469
communal standpipe             0.621485  ...        0.299278
hand pump                      0.617852  ...        0.323307
communal standpipe multiple    0.366213  ...        0.527609
other                          0.131661  ...        0.822414

[7 rows x 3 columns]
status_group           functional  functional needs repair  non functional
waterpoint_type_group                                                     
dam                      0.857143                 0.000000        0.142857
cattle trough            0.724138                 0.017241        0.258621
improved spring          0.718112                 0.108418        0.173469
hand pump                0.617852                 0.05

In [24]:
# columns to drop
drop = [

        # kept

        # # low cardinality
        # 'basin', #                keep because low cardinality
        # 'quality_group', #        keep because low cardinality
        # 'quantity', #             keep because low cardinality
        # 'scheme_management', #    keep because low cardinality
        # 'water_quality', #        keep because low cardinality
        
        # # think it was better than alternatives
        # 'extraction_type', #      dont drop becauase i think this is the most use full of the extraction_type ones
        # 'management', #           keep becuase i thought it was more useful than management_group
        # 'payment', #              keep becuase i don't think there's much of a difference between payment, and payment_type
        # 'region_code', #          keep becuase i prefered over region
        # 'source', #               keep because prefered to source_class, source_type
        # 'waterpoint_type', #      keep because prefered over waterpoint_type_group

        # # kept because number
        # 'district_code', #        keep because numeric
        # 'gps_height', #           keep because number and i think there's a relationship

        # dropped

        # dates
        'construction_year', #      drop because has missing and is date
        'date_recorded', #          drop because date

        # had alternatives
        'extraction_type_class', #  didn't think was as useful as extraction_type
        'extraction_type_group', #  same with class 
        'management_group', #       drop because i prefered management
        'payment_type', #           drop because payment is kinda the same thing
        'region', #                 drop becuase i prefered region code
        'waterpoint_type_group', #  drop because prefer waterpoint_type
        'source_class', #           drop because source is better
        'source_type', #            drop because source is better

        # useless
        'id', #                     drop because doesn't mean anything
        'latitude', #               doesn't mean anything
        'longitude', #              same as latitude
        'recorded_by', #            drop because it's only 1 value
        'quantity_group', #         drop because just a repeat of quatity
        'permit', #                 drop because missing values, 5%, and i dont think it would add anything, didn't notice a relationship
        'public_meeting', #         drop because missing values, 5%, and too many trues
        'population', #             drop because too many 0 values, which i assume are missing?
        'amount_tsh', #             drop becuase too many 0, so data is too skewed
        'num_private', #            drop because too many 0

        # too many missing
        'scheme_name', #            drop because too many missing 47%

        # high cardinality
        'wpt_name', #               drop because high cardinality
        'subvillage', #             drop because too high cardinality
        'ward', #                   drop because too high cardinality
        'funder', #                 drop becuase high cardinality and missing values, 6%
        # 'installer', #              same as funder, too many values and missing values, 6%
        'lga', #                    drop because too many values, maybe could encode somehow
        ]

features = train_features.drop(drop, axis=1)
features.dtypes.sort_values()

gps_height            int64
region_code           int64
district_code         int64
basin                object
scheme_management    object
extraction_type      object
management           object
payment              object
water_quality        object
quality_group        object
quantity             object
source               object
waterpoint_type      object
dtype: object

In [25]:
columns_to_use = features.columns.to_list()
columns_to_use

['gps_height',
 'basin',
 'region_code',
 'district_code',
 'scheme_management',
 'extraction_type',
 'management',
 'payment',
 'water_quality',
 'quality_group',
 'quantity',
 'source',
 'waterpoint_type']

# Do train/validate/test split with the Tanzania Waterpumps data.

In [0]:
from sklearn.model_selection import train_test_split

train_features = pd.read_csv('../data/tanzania/train_features.csv')
train_labels = pd.read_csv('../data/tanzania/train_labels.csv')
test_features = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

X_train = train_features
y_train = train_labels['status_group']

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size = 0.80, stratify = y_train, random_state = 69
)

# Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)

In [123]:
cats = X_train[columns_to_use].select_dtypes('object').columns.to_list()
cats

['basin',
 'scheme_management',
 'extraction_type',
 'management',
 'payment',
 'water_quality',
 'quality_group',
 'quantity',
 'source',
 'waterpoint_type']

In [0]:
cats = X_train[cats].columns[X_train[cats].nunique() < 25].tolist()

In [125]:
X_train[cats].nunique()

basin                 9
scheme_management    12
extraction_type      18
management           12
payment               7
water_quality         8
quality_group         6
quantity              5
source               10
waterpoint_type       7
dtype: int64

In [126]:
import category_encoders as ce 
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
import bisect 

# installer

label_encoder = LabelEncoder()


# train on train installer
X_train['installer'] = label_encoder.fit_transform(X_train['installer'].fillna(X_train['installer'].mode()[0]))

# change classes to have other in there?
classes = label_encoder.classes_.tolist()
bisect.insort_left(classes, 'other')
label_encoder.classes_ = classes

# map validation data to change any unseen installers to 'other'
X_val['installer'] = X_val['installer'].map(lambda s: 'other' if s not in label_encoder.classes_ else s)

# tranform 
X_val['installer'] = label_encoder.transform(X_val['installer'])

numeric_features = X_train[columns_to_use].select_dtypes(['number', 'bool']).columns.tolist()
features = cats + numeric_features + ['installer']

X_train_subset = X_train[features]
X_val_subset = X_val[features]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train_subset)
X_val_encoded = encoder.transform(X_val_subset)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_val_scaled = scaler.transform(X_val_encoded)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [128]:
features

['basin',
 'scheme_management',
 'extraction_type',
 'management',
 'payment',
 'water_quality',
 'quality_group',
 'quantity',
 'source',
 'waterpoint_type',
 'gps_height',
 'region_code',
 'district_code',
 'installer']

# Use scikit-learn for logistic regression.

In [132]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import f_regression, SelectKBest
import time

y_int = y_train.map({'functional': 3, 'functional needs repair': 2, 'non functional': 1})

start = time.time()

model = RandomForestClassifier(n_jobs = -1, n_estimators = 1000)

model.fit(X_train_scaled, y_train)

print('score ', model.score(X_val_scaled, y_val))

print('time ', (time.time() - start) / 60, ' mins')

score  0.7697811447811448
time  0.631817889213562  mins


### my first was: ***0.725***
### second: 0.732