# Project Description

- Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
- You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
- Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Step 1: Open the data file and study the general information

Load necessary libraries

In [1]:
import pandas as pd
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Open data file and study the general info

In [2]:
plan = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv')
plan.info()
plan.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


# Conclusions for Step 1
- There are 3214 observations, describing the info for each customer of Megaline.
- The features include: calls (number of calls), minutes (total call duration in minutes), messages (number of text messages), mb_used (Internet traffic used in MB).
- The target is the is_ultra variable: plan for the current month (Ultra - 1, Smart - 0).
- There are no missing values (all variables have 3214 non-null observations).
- The data types are correct for all variables.
### Therefore we don't need to preprocess the data

# Step 2: Split the source data into a training set, a validation set, and a test set with the ratio 3:1:1

Create the feature and target variables

In [3]:
feature = plan.drop('is_ultra', axis=1)
target = plan['is_ultra']

Split the source data into a training set, a validation set, and a test set with the ratio 3:1:!
Use train_split_test twice

In [4]:
#First split the data into 2 sets: 60% for train and 40% for test (actually for test and validate)
feature_train, feature_test, target_train, target_test = train_test_split(feature, target, 
                                                                           test_size=0.4, random_state=12)

In [5]:
#Second split the above data into further 2 sets of test and validate, each has 50% of data
feature_val, feature_test, target_val, target_test = train_test_split(feature_test, target_test,
                                                                     test_size=0.5, random_state=12)

Test if each dataset has the correct proportion (60%, 20%, 20% of the total number of observations of the source data)

In [6]:
for data in [target_train, target_val, target_test]:
    print(round(len(data)/len(target), 2))

0.6
0.2
0.2


Another way to split the data into 3 parts with the ratio 3:1:1, but not use in this project
train, validate, test = np.split(plan.sample(frac=1), [int(.6*len(plan)), int(.8*len(plan))])

# Step 3: Investigate the quality of different models by changing hyperparameters

## Type of models that are appropriate with our research goal:
Because we want to predict a categorical variable (if a customer choose the **ultra plan** or not), we need to use **Classification** models, including:
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression

### Create the DecisionTreeClassfier model
- Create a loop to test different hyperparameters (max_depth): test from 1-5
- Then choose the best model based on accuracy score of the validation set

In [7]:
for depth in range (1,11):
    model1 = DecisionTreeClassifier(random_state=12, max_depth=depth)
    model1.fit(feature_train, target_train)
    print('max_depth:', depth)
    print('Training set:', model1.score(feature_train, target_train))
    print('Validate set:', model1.score(feature_val, target_val))

max_depth: 1
Training set: 0.7655601659751037
Validate set: 0.7387247278382582
max_depth: 2
Training set: 0.799792531120332
Validate set: 0.7651632970451011
max_depth: 3
Training set: 0.8153526970954357
Validate set: 0.7636080870917574
max_depth: 4
Training set: 0.8257261410788381
Validate set: 0.7682737169517885
max_depth: 5
Training set: 0.8309128630705395
Validate set: 0.7651632970451011
max_depth: 6
Training set: 0.8454356846473029
Validate set: 0.7698289269051322
max_depth: 7
Training set: 0.8630705394190872
Validate set: 0.7682737169517885
max_depth: 8
Training set: 0.870850622406639
Validate set: 0.7682737169517885
max_depth: 9
Training set: 0.8807053941908713
Validate set: 0.7744945567651633
max_depth: 10
Training set: 0.8921161825726142
Validate set: 0.7744945567651633


### DecisionTree model selection: Select the model with max_depth = 4 because its validation set's accuracy score is on the higher end, and the difference between training and test set is not too big (not too overfitted)

### Create the RandomForestClassfier model
- Create a loop to test different hyperparameters (n_estimators): test from 10 - 100, take only factors of 10 (10, 20, 30, 40, 50, ..., 100), set the maximum_depth = 10
- Then choose the best model based on accuracy score of the validation set

In [8]:
for estimator in range(10, 101, 10):
    model2 = RandomForestClassifier(random_state=12, max_depth=10,
                                    n_estimators=estimator)
    model2.fit(feature_train, target_train)
    print('n_estimators:', estimator)
    print('Training set:', model2.score(feature_train, target_train))
    print('Validate set:', model2.score(feature_val, target_val))
    

n_estimators: 10
Training set: 0.8941908713692946
Validate set: 0.7807153965785381
n_estimators: 20
Training set: 0.8962655601659751
Validate set: 0.7978227060653188
n_estimators: 30
Training set: 0.8978215767634855
Validate set: 0.7962674961119751
n_estimators: 40
Training set: 0.8983402489626556
Validate set: 0.7900466562986003
n_estimators: 50
Training set: 0.9004149377593361
Validate set: 0.7931570762052877
n_estimators: 60
Training set: 0.8993775933609959
Validate set: 0.7900466562986003
n_estimators: 70
Training set: 0.8978215767634855
Validate set: 0.7916018662519441
n_estimators: 80
Training set: 0.8978215767634855
Validate set: 0.7900466562986003
n_estimators: 90
Training set: 0.8967842323651453
Validate set: 0.7900466562986003
n_estimators: 100
Training set: 0.8973029045643154
Validate set: 0.7900466562986003


### RandomForest model selection: Select the model with n_estimators = 20 because its validation set's accuracy score is highest, the run speed is also faster than the models with bigger n_estimators.

### Create the LogisticRegression model
No need to tune the hyperameter(s)

In [9]:
model3 = LogisticRegression(random_state=12, solver = 'liblinear')
model3.fit(feature_train, target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [10]:
print('Accuracy')
print('Training set: ', model3.score(feature_train, target_train))
print('Test set: ', model3.score(feature_val, target_val))

Accuracy
Training set:  0.716804979253112
Test set:  0.6780715396578538


# Step 4: Check the quality of the model using the test set

### Compare the three models by comparing the accuracy score of the test set

Save the selected model with their corresponding hyperparameters

In [11]:
model1 = DecisionTreeClassifier(random_state=12, max_depth=4)
model2 = RandomForestClassifier(random_state=12, max_depth=10, n_estimators=20)
model3 = LogisticRegression(random_state=12, solver = 'liblinear')                                    

Fit the 3 models into our training data 

In [12]:
model1.fit(feature_train, target_train)
model2.fit(feature_train, target_train)
model3.fit(feature_train, target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Get predictions using 3 models

In [13]:
prediction_test_decisiontree = model1.predict(feature_test)

prediction_test_randomforest = model2.predict(feature_test)

prediction_test_logisticregression = model3.predict(feature_test)

Calculate Accuracy score using test set

In [14]:
print("Decision Tree Model", ": ", end="")
print(accuracy_score(target_test, prediction_test_decisiontree))
print("Random Forest Model", ": ", end="")
print(accuracy_score(target_test, prediction_test_randomforest))
print("Logistic Regression Model", ": ", end="")
print(accuracy_score(target_test, prediction_test_logisticregression))

Decision Tree Model : 0.7682737169517885
Random Forest Model : 0.7822706065318819
Logistic Regression Model : 0.6594090202177294


Check overfitness by comparing the accuracy score between train and test sets for each model

In [15]:
print('Decision Tree Model')
print('Training set:', model1.score(feature_train, target_train))
print('Validate set:', model1.score(feature_test, target_test))
      
print('Random Forest Model')
print('Training set:', model2.score(feature_train, target_train))
print('Validate set:', model2.score(feature_test, target_test))    
      
print('Logistic Regression Model')
print('Training set:', model3.score(feature_train, target_train))
print('Validate set:', model3.score(feature_test, target_test))  

Decision Tree Model
Training set: 0.8257261410788381
Validate set: 0.7682737169517885
Random Forest Model
Training set: 0.8962655601659751
Validate set: 0.7822706065318819
Logistic Regression Model
Training set: 0.716804979253112
Validate set: 0.6594090202177294


# Best model: Random Forest Model
- Pro: It's accuracy score is the highest, also satisfying the threshold for accuracy score
- Con: This model is also the most overfitted model. However, its accuracy might compensate for this overfitness

# Conclusion
- Because we want to predict a categorical variable (if a customer choose the **ultra plan** or not), we need to use **Classification** models, including: Decision Tree Classifier, Random Forest Classifier, and Logistic Regression
- The source data has been splitted into 3 sets: train, validation and test, where:
    - Train set: used to train the models
    - Validation set: used to tune hyperparameters for each model
    - Test set: used to check the quality of the models
- Factors involving in model selection:
    - Accuracy score
    - Run speed
    - The trade-off between accuracy and overfitting/underfitting 
- After considering the above factors, the Random Forest Model was chosen
- Caveat: There are some bias in selecting hyperameters. For example: for Decision Tree, only choose to test from 1 - 10, for Random Forest: choose the max_depth = 10 rather than creating combination between max_depth and n_estimators --> Need a better model selection method