## Mobile Plan Classifier

A mobile carrier intends to analyse usage metrics for calls, messages and internet data to recommend legacy users to a new plan. The plans are Smart and Ultra. We have a cleaned and processed dataset to work with.

### Prepare the Data

In [2]:
#Import data handling libraries
import pandas as pd
import numpy as np

In [3]:
try:
    df = pd.read_csv('/datasets/users_behavior.csv')
except:
    print('Could not load dataset, check filepath')

In [4]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Let's split the data into 3 sets in a 3:1:1 ratio -> Train (60%), Validate (20%), Test (20%)

In [6]:
#Import relevant library
from sklearn.model_selection import train_test_split

In [7]:
df_train, df_test_validate = train_test_split(df,train_size = 0.6, random_state = 12345)

In [8]:
#Making sure the split is as expected.
df_train.shape[0]

1928

In [9]:
df_validate, df_test = train_test_split(df_test_validate,train_size = 0.5, random_state = 12345)

In [10]:
#Making sure the validate set makes up 20% of total dataset
df_validate.shape[0] * 100 / df.shape[0]

20.00622277535781

### Test Models

This is a classification problem. We need to predict if users are on ultra plan or not based on usage metrics. We will utilise the following models and pick the one with best accuracy:
1. Logistic Regression 
2. Decision Tree Classifier
3. Random Forest Classifier

Our target accuracy is **at least 75%**

Let's utilise the `accuracy_score` module to test accuracy of our models.

In [11]:
from sklearn.metrics import accuracy_score

In [12]:
df_train_features = df_train.drop(['is_ultra'], axis = 1)
df_train_features.head()

Unnamed: 0,calls,minutes,messages,mb_used
3027,60.0,431.56,26.0,14751.26
434,33.0,265.17,59.0,17398.02
1226,52.0,341.83,68.0,15462.38
1054,42.0,226.18,21.0,13243.48
1842,30.0,198.42,0.0,8189.53


In [13]:
df_train_target = df_train['is_ultra']
df_train_target.head()

3027    0
434     0
1226    0
1054    0
1842    0
Name: is_ultra, dtype: int64

In [14]:
df_validate_features = df_validate.drop(['is_ultra'],axis = 1)
df_validate_features.head()

Unnamed: 0,calls,minutes,messages,mb_used
1386,92.0,536.96,18.0,20193.9
3124,40.0,286.57,17.0,17918.75
1956,81.0,531.22,56.0,17755.06
2286,67.0,460.76,27.0,16626.26
3077,22.0,120.09,16.0,9039.57


In [16]:
df_validate_target = df_validate['is_ultra']
df_validate_target.head()

1386    0
3124    0
1956    0
2286    0
3077    0
Name: is_ultra, dtype: int64

In [17]:
df_test_features = df_test.drop(['is_ultra'],axis = 1)
df_test_features.head()

Unnamed: 0,calls,minutes,messages,mb_used
160,61.0,495.11,8.0,10891.23
2498,80.0,555.04,28.0,28083.58
1748,87.0,697.23,0.0,8335.7
1816,41.0,275.8,9.0,10032.39
1077,60.0,428.49,20.0,29389.52


In [18]:
df_test_target = df_test['is_ultra']
df_test_target.head()

160     0
2498    0
1748    0
1816    0
1077    1
Name: is_ultra, dtype: int64

In [19]:
def accuracy (model_name,model, df_validate_features = df_validate_features, df_validate_target = df_validate_target, 
              df_test_features = df_test_features, df_test_target = df_test_target):
    
    #Build validation prediction array
    validate_predictions = model.predict(df_validate_features)
    #Report Accuracy
    validate_accuracy = accuracy_score(df_validate_target, validate_predictions)
    print(f'Accuracy of {model_name} on Validation Set: {validate_accuracy * 100:.2f}%')
    #Build test prediction array
    test_predictions = model.predict(df_test_features)
    #Report accuracy
    test_accuracy = accuracy_score(df_test_target, test_predictions)
    print(f'Accuracy of {model_name} on Test Set: {test_accuracy * 100:.2f}%')

#### Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
#Set hyperparameters
model_logistic = LogisticRegression(random_state = 12345, solver = 'liblinear', max_iter = 200)

In [22]:
#Train the model
model_logistic.fit(df_train_features, df_train_target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [23]:
accuracy('LogisticRegression', model_logistic)

Accuracy of LogisticRegression on Validation Set: 75.89%
Accuracy of LogisticRegression on Test Set: 74.03%


This doesn't quite reach out expected threshold of 75%. We can tweak some parameters and try again but first, let's look at other models.

#### Decision Tree Classifier

In [24]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
#Set hyperparameters. We keep the depth as 5 to start with and will vary it as needed.
model_tree = DecisionTreeClassifier(random_state = 12345, max_depth = 5)

In [26]:
#Let's train the model
model_tree.fit(df_train_features, df_train_target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=12345, splitter='best')

In [27]:
accuracy('Decision Tree',model_tree)

Accuracy of Decision Tree on Validation Set: 77.92%
Accuracy of Decision Tree on Test Set: 78.38%


This is an acceptably high accuracy

In [28]:
train_predictions = model_tree.predict(df_train_features)

In [29]:
train_accuracy = accuracy_score(df_train_target, train_predictions)
print(f'Accuracy of Decision Tree on Training Set: {train_accuracy * 100:.2f}%')

Accuracy of Decision Tree on Training Set: 82.00%


The model is not too overfit either.

#### Random Forest Classifier

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
#Set hyperparameters. Random Forest is slower but we can accept that in return for higher accuracy.
model_forest = RandomForestClassifier(random_state = 12345, n_estimators = 100)

In [32]:
model_forest.fit(df_train_features, df_train_target)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [38]:
validate_predictions = model_forest.predict(df_validate_features)

validate_accuracy = accuracy_score(df_validate_target, validate_predictions)
print(f'Accuracy of Random Forest on Validation Set: {validate_accuracy * 100:.2f}%')

Accuracy of Random Forest on Validation Set: 78.54%


The accuracy is similar to a Decision Tree but more consistent. Let's tweak the hyperparameters to see if we can get better accuracy.

In [34]:
model_forest_tweaked = RandomForestClassifier(random_state = 12345, n_estimators = 200)

In [35]:
model_forest_tweaked.fit(df_train_features, df_train_target)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [39]:
validate_predictions = model_forest_tweaked.predict(df_validate_features)

validate_accuracy = accuracy_score(df_validate_target, validate_predictions)
print(f'Accuracy of Random Forest Tweaked on Validation Set: {validate_accuracy * 100:.2f}%')

Accuracy of Random Forest Tweaked on Validation Set: 78.69%


A very similar accuracy is obtained for `n_estimators` = 200 as well. We don't gain a lot of accuracy but we add computation time. We will stick with the `n_estimators` = 100. 

In [40]:
accuracy('Random Forest',model_forest)

Accuracy of Random Forest on Validation Set: 78.54%
Accuracy of Random Forest on Test Set: 78.54%


#### Sanity Check 

We will utilise a constant model that predicts the majority class in the data as a benchmark

In [37]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [41]:
#Randomly generate predictions, 50% of which indicate users on the Ultra plan.
predictions = [0 for i in range(df.shape[0])]

In [43]:
df_target = df['is_ultra']

In [44]:
accuracy_sanity = accuracy_score(df_target,predictions)
print(f'Accuracy for sanity check is: {accuracy_sanity * 100:.2f}%')

Accuracy for sanity check is: 69.35%


### Conclusions

1. We find that the most accurate and robust model is the Random Forest Classifier. We know that the model takes the longest time to train.
2. A good balance is the Decision Tree Classifier which trains faster and has similar accuracy but is more sensitive to changes to hyperparameters like `max_depth`. It's accuracy is 78.38%
3. The sanity check reveals an accuracy of ~69.4%. Our model is more accurate than our sanity check.