# Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you‚Äôve already performed the data preprocessing step, you can move straight to creating the model.  

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

## Exploratory Data Analysis

### Importing

#### Importing Libraries

In [1]:
from IPython.display import Image
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
%matplotlib inline

#### Import and Review Data

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
display(df.head(5))
df.shape

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


(3214, 5)

<div class="alert alert-info"> <b> Data Review </b>:
    <li> There are 5 columns and 3214 rows in this data.</li>
    <li> We do not have any missing data in any of the columns </li>

In [3]:
# Let's checkf or duplicates of the file

print(df.duplicated().sum())

0


In [4]:
display(df.describe())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<div class="alert alert-info"> <b> Analysis Review </b>:
    <li> Total observed data is 3214</li>
    <li> There aren't any missing values in any of the columns and there are not duplicate rows.</li>
    <li> There are total of 4 numeric value type columns and 1 categorical value column. is_ultra has numeric value, but the numeric value indicates whether a customer is a Ulatra plan user or not. Numeric value doesn't represent any gage in this column.</li>
    <li> Variables<ul>
        <li> Calls <ul>
            <li> Minimum call value is 0 and Maximum call value is 244 </li>
            <li> Median call value is 62 and Average call value is 63.04 </li>
            </ul>
        <li> Minutes<ul>
            <li> Minimum minutes value is 0 and Maximum minutes value is 1632.06</li>
            <li> Median minutes value is 430.6 and Average minutes value is 438.21</li>
            </ul>
        <li> Messages<ul>
            <li> Minimum messages value is 0 and Maximum messages value is 224</li>
            <li> Median messages value is 30 and Average messages value is 38.28</li>
            </ul>
        <li> MB Used<ul>
            <li> Minimum mb used value is 0 and Maximum mb used value is 49745.73</li>
            <li> Median mb used value is 16943.24 and Average mb used value is 17207.67</li>
            </ul>
        <li> Ultra vs Smart <ul>
            <li> There are 2229 Ultra users and 985 Smart plan users. </li>
            </ul>

## Spliting the data set
Split the data into training - 60%, validation - 20%, test - 20%

In [5]:
x = df.drop(['is_ultra'], axis=1)
y = df['is_ultra']

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=4321)

x_train, x_val, y_train, y_val = train_test_split(
    x_train, y_train, test_size=0.25, random_state=4321)

In [7]:
# Confirm the size of splitted data sets
print('x_train size:', x_train.shape)
print('x_test size:', x_test.shape)
print('x_val size:', x_val.shape)

x_train size: (1928, 4)
x_test size: (643, 4)
x_val size: (643, 4)


## Training Models

### Decision Tree Model and Evaluation

In [8]:
tree = DecisionTreeClassifier(random_state=4321)
tree.fit(x_train, y_train)
tree.score(x_val, y_val)

0.7153965785381027

#### Use hyperparameter tuning to improve the model

In [9]:
best_depth = 0
best_val_accuracy = 0
for i in range(1, 21):
    tree = DecisionTreeClassifier(random_state=4321, max_depth=i)
    tree.fit(x_train, y_train)
    val_accuracy=tree.score(x_val,y_val)
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        best_depth=i
print("Best depth: ",best_depth)

Best depth:  8


In [10]:
tr_final = DecisionTreeClassifier(max_depth=best_depth, random_state=4321)
tr_final.fit(x_train,y_train)
tr_acc = tr_final.score(x_test, y_test)
print("Test Accuracy of Decision Tree Model is: ", tr_acc)

Test Accuracy of Decision Tree Model is:  0.7962674961119751


<div class="alert alert-info"> <b> Decision Tree Classifier Review </b>:
    <li> Best max depth for decision tree is 8 at validation accuracy rate of 79.62%</li>

### Random Forest Model and Evaluation

In [11]:
rf = RandomForestClassifier(random_state=4321)
rf.fit(x_train, y_train)
rf.score(x_val, y_val)

0.7713841368584758

#### Improve the model

In [12]:
# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],  # Number of trees in the forest
    'max_depth': [None, 5, 10, 15]  # Maximum depth of the trees
}

rf = RandomForestClassifier(random_state=4321)
grid_search = GridSearchCV(rf, param_grid, cv=None, scoring='accuracy')
grid_search.fit(pd.concat([x_train, x_val]), pd.concat([y_train, y_val]))

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

Best Parameters: {'max_depth': 10, 'n_estimators': 150}


In [13]:
rf_final = RandomForestClassifier(random_state=4321,n_estimators=150,max_depth=10)
rf_final.fit(x_train,y_train)
rf_final_acc=rf_final.score(x_val,y_val)
print("Test Accuracy of Random Forest Model is: ", rf_final_acc)

Test Accuracy of Random Forest Model is:  0.7822706065318819


<div class="alert alert-info"> <b> Random Forest Classifier Review </b>:
    <li> Best parameter when using Random Forest Classifer is max_depth=10 and n_estimators = 150. This yields the accuracy rate of 78.23%</li>

### Logic Regression Model and Evaluation

In [14]:
lg = LogisticRegression(random_state=4321,solver='liblinear')
lg.fit(x_train,y_train)
lg_train_score=lg.score(x_train,y_train)
lg_val_score = lg.score(x_val,y_val)

print("Accuracy of the logistic regression model on the training set: ", lg_train_score)
print("Accuracy of the logistic regression model on the validation set: ", lg_val_score)

Accuracy of the logistic regression model on the training set:  0.7152489626556017
Accuracy of the logistic regression model on the validation set:  0.713841368584759


#### Improve the model

In [15]:
# Hyperparameter tuning using GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10],  # Inverse of regularization strength
    'solver': ['liblinear', 'lbfgs']  # Algorithm to use in the optimization problem
}

grid_search = GridSearchCV(lg, param_grid, cv=None, scoring='accuracy')
grid_search.fit(pd.concat([x_train, x_val]), pd.concat([y_train, y_val]))

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Best Parameters: {'C': 0.01, 'solver': 'lbfgs'}


In [16]:
lg_final=LogisticRegression(**best_params,random_state=4321)
lg_final.fit(pd.concat([x_train, x_val]), pd.concat([y_train, y_val]))

lg_final_acc=lg_final.score(x_test,y_test)
print("Test Accuracy of Logistic Regression Model is: ", lg_final_acc)

Test Accuracy of Logistic Regression Model is:  0.71850699844479


<div class="alert alert-info"> <b> Logistic Regression Model Review </b>:
    <li> Best parameter when using Logistic Regression Model is {'C': 0.01, 'solver': 'lbfgs'}. This yields  the accuracy rate of 71.85%</li>

<div class="alert alert-info"> <b> Model Comparison and model decision </b>:
    <li> Based on the accuracy of the models, Decision Tree yielded the best result with accuracy rate of 79.63%</li>

## Sanity Check

<div class="alert alert-info"> <b> Is the model doing anything? </b>:
    <li> Chances of guessing if a customer is better fit for Ultra is 69.35% based on data. (2229/(2229+985))</li>
    <li> Hence, if a model's acccuracy is lower than 69.35% we can say that it's not doing any better than just guessing. </li> 
    <li> In our case, all 3 models yielded higher accuracy than 69.35% so it's they are doing better than just guess work <ul>
        <li> out of the 3 models we have created, 

<div class="alert alert-block alert-success">‚úîÔ∏è
    

__Reviewer's comment ‚Ññ2__


Otherwise it's greatüòä. Your project is begging for github =)   
    
Congratulations on the successful completion of the project üòäüëç
And I wish you success in new works üòä