## Homework 3: Machine Learning for Classification

#### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here] (https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

In this dataset our desired target for classification task will be converted variable - has the client signed up to the platform or not.


In [1]:
import pandas as pd
import requests

from pathlib import Path


FILE_NAME = 'course_lead_scoring.csv'


def fetch():
    resp = requests.get(
        f'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/{FILE_NAME}',
        allow_redirects=False,
        timeout=10,
    )

    resp.raise_for_status()

    with open(FILE_NAME, 'w') as f:
        f.write(resp.text)

if not Path(FILE_NAME).exists():
    fetch()

df = pd.read_csv(FILE_NAME)
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


#### Data preparation

Check if the missing values are presented in the features.

If there are missing values:
 * For caterogiral features, replace them with 'NA'
 * For numerical features, replace with with 0.0

In [2]:
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
df[categorical_columns] = df[categorical_columns].fillna('NA')
df.fillna(0, inplace=True)

In [3]:
df.isnull().any()

lead_source                 False
industry                    False
number_of_courses_viewed    False
annual_income               False
employment_status           False
location                    False
interaction_count           False
lead_score                  False
converted                   False
dtype: bool

## Question 1
What is the most frequent observation (mode) for the column industry?

In [4]:
df['industry'].value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

## Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

In [5]:
pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'interaction_count'),
]

corr = df.corr(numeric_only=True)
corr

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879,0.435914
annual_income,0.00977,1.0,0.027036,0.01561,0.053131
interaction_count,-0.023565,0.027036,1.0,0.009888,0.374573
lead_score,-0.004879,0.01561,0.009888,1.0,0.193673
converted,0.435914,0.053131,0.374573,0.193673,1.0


In [6]:
rank = {}
for idx, pair in enumerate(pairs):
    rank[pairs[idx]] = round(abs(corr.loc[pair[0]][pair[1]]), 4)

In [7]:
rank

{('interaction_count', 'lead_score'): np.float64(0.0099),
 ('number_of_courses_viewed', 'lead_score'): np.float64(0.0049),
 ('number_of_courses_viewed', 'interaction_count'): np.float64(0.0236),
 ('annual_income', 'interaction_count'): np.float64(0.027)}

In [8]:
max(rank, key=rank.get)

('annual_income', 'interaction_count')

Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value y is not in your dataframe.


In [9]:
from sklearn.model_selection import train_test_split

df_train_full, df_val = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_test = train_test_split(df_train_full, test_size=0.25, random_state=42)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

## Question 3

* Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using round(score, 2).


In [11]:
from sklearn.metrics import mutual_info_score

rank = {}
for c in categorical_columns:
    rank[c] = round(mutual_info_score(df_train[c], y_train), 2)


In [12]:
rank

{'lead_source': 0.04,
 'industry': 0.01,
 'employment_status': 0.01,
 'location': 0.0}

In [13]:
max(rank,  key=rank.get)

'lead_source'

## Question 4

Now let's train a logistic regression.

Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

Fit the model on the training dataset.

To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:

```
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
```

Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


In [14]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)


In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict_proba(X_val)[:, 1]
converted_decision = (y_pred >= 0.5)

round((y_val == converted_decision).mean(), 2)

np.float64(0.73)

## Question 5

Let's find the least useful feature using the feature elimination technique.

Train a model using the same features and parameters as in Q4 (without rounding).

Now exclude each feature from this set and train a model without it. Record the accuracy for each model.

For each feature, calculate the difference between the original accuracy and the accuracy without the feature.


In [16]:
def get_x(df, features):
    dv = DictVectorizer(sparse=False)
    train_dict = df[features].to_dict(orient='records')
    return dv.fit_transform(train_dict)

def train(X_train, y_train, X_val, y_val):
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]
    converted_decision = (y_pred >= 0.5)

    return round((y_val == converted_decision).mean(), 4)

# base model

features = list(df_train.columns)

X_train_base = get_x(df_train, features)
X_val_base = get_x(df_val, features)

base_accurasy = train(X_train_base, y_train, X_val_base, y_val)

In [17]:

elimitante = {k:0 for k in ['industry', 'employment_status', 'lead_score']}

for k, _ in elimitante.items():
    features = list(df_train.columns)
    features.remove(k)
    X_train_base = get_x(df_train, features)
    X_val_base = get_x(df_val, features)

    elimitante[k] = round(abs(base_accurasy -train(X_train_base, y_train, X_val_base, y_val)), 4)

In [19]:
elimitante

{'industry': np.float64(0.0102),
 'employment_status': np.float64(0.0136),
 'lead_score': np.float64(0.0068)}

In [25]:
min(elimitante,  key=elimitante.get)

'lead_score'

## Question 6


Now let's train a regularized logistic regression.

Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].

Train models using all the features as in Q4.

Calculate the accuracy on the validation dataset and round it to 3 decimal digits.


In [21]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

rank = {}

for c in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)

    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]

    converted_decision = (y_pred >= 0.5)

    rank[c] = round((y_val == converted_decision).mean(), 10)

In [22]:
rank

{0.01: np.float64(0.7303754266),
 0.1: np.float64(0.7303754266),
 1: np.float64(0.7269624573),
 10: np.float64(0.7269624573),
 100: np.float64(0.7269624573)}

In [23]:
max(rank,  key=rank.get)

0.01