# Contens

- Project Description
- Data Description
- Project Instruction :
  - Open the dataset, separate in to the training set, validation set, and test set.
  - Check the quality of different models by changing their hyperparameters.
  - Check the quality of the model by using the test set.
  - Perform a sanity check on the model
- Conclusion

## Project Description

MEGALINE mobile operator is dissatisfied that many of their customers are still using the old plan. The company wanted to develop a model that could analyze consumer behavior and recommend one of Megaline's two newest plans: Smart or Ultra.

Goals :
- Develop a model to analyze the customer behaviour and recomend the plan package, with :
  - highest accuracy
  - accuracy lower limit : 0.75
  - check the model accuracy metric with test dataset

## Data Descrption

- The dataset containing every single user monthly behaviour, as follows :
  - calls   : number of call
  - minutes : total call duration ( in minutes )
  - message : number of text message
  - mb_used : usage internet traffic in MB
  - is_ultra: data plan in current month ( Ultra = 1, Smart = 0)

## Project Instruction

File path: /datasets/users_behavior.csv

note : data already prepared ( ready for further analize )


### Open the dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

#open the dataset
df = pd.read_csv('/datasets/users_behavior.csv')

In [2]:
#check dataset
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
#checking dataset missing value
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [4]:
#checking dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### Separate the dataset into training set, validation set, and test set data

- training + validation --> to develop the model + tunning hypermaparameters
  - final model
- test set --> to be used for testing the final model


In [5]:
#separate it in to the training set, validation set, and test set

train_valid, test = train_test_split(df, test_size = 0.2)
train, valid = train_test_split(train_valid, test_size = 0.25)

#train
features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']

#validation
features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']

#test
features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

#check features shape
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


In [6]:
#check dataset shape
df.shape

(3214, 5)

- note
  - after separation processes :
    - dataset shape and features shape is in the same value ( 1928 + 643 + 643 = 3124 )
    - separatation has been done with the correct result

In [7]:
#check features_train dataset
features_train.head()

Unnamed: 0,calls,minutes,messages,mb_used
1463,7.0,22.96,7.0,12762.99
1530,10.0,94.06,8.0,621.63
1402,87.0,637.26,81.0,12511.37
2982,50.0,362.53,0.0,18476.21
1905,50.0,386.2,20.0,17963.49


In [8]:
#check features_valid dataset
features_valid.head()

Unnamed: 0,calls,minutes,messages,mb_used
2806,107.0,733.45,98.0,10756.78
1019,77.0,530.07,69.0,12697.45
3002,35.0,213.87,15.0,20097.61
1321,56.0,302.74,70.0,29223.17
46,76.0,535.91,65.0,11968.22


In [9]:
#check features_test dataset
features_test.head()

Unnamed: 0,calls,minutes,messages,mb_used
1346,49.0,310.78,3.0,11506.66
1284,29.0,192.29,73.0,14620.52
3057,59.0,381.49,47.0,17272.81
827,54.0,448.87,0.0,9670.72
1195,66.0,462.58,100.0,11846.31


### Check the quality of different models by changing their hyperparameters.

#### Model with default hyperparameters / without tunning hyperparameters

In [10]:
#Decission Tree Regression without hyperparameter tunning

dtree = DecisionTreeClassifier()
dtree.fit(features_train, target_train)

y_prediction_valid = dtree.predict(features_valid)
y_prediction_test = dtree.predict(features_test)

print('DecissionTree Regression without hyperparameter tunning')
print ('validation accuracy score : ',accuracy_score (target_valid, y_prediction_valid)*100)
print ('testing accuracy score    : ',accuracy_score (target_test, y_prediction_test)*100)

DecissionTree Regression without hyperparameter tunning
validation accuracy score :  71.22861586314151
testing accuracy score    :  73.71695178849144


- Findings :
  - the range of validation and testing accuracy scores is quite close
  - data is not overfitting

In [11]:
#Logistic Regression without hyperparameter tunning

log_reg = LogisticRegression()
log_reg.fit(features_train, target_train)

y_prediction_train_2 = log_reg.predict(features_train)
y_prediction_valid_2 = log_reg.predict(features_valid)

print('Logistic Regression without hyperparameter tunning')
print ('training accuracy score   : ',accuracy_score (target_train, y_prediction_train_2)*100)
print ('validation accuracy score : ',accuracy_score (target_valid, y_prediction_valid_2)*100)

Logistic Regression without hyperparameter tunning
training accuracy score   :  74.79253112033194
validation accuracy score :  76.82737169517885


- Findings :
  - the range of validation and testing accuracy scores is quite close
  - data is not overfitting

In [12]:
#RandomForest without hyperparameter tunning
rf = RandomForestClassifier()
rf.fit(features_train, target_train)

y_prediction_valid_3 = rf.predict(features_valid)
y_prediction_test_3 = rf.predict(features_test)

print('RandomForest without hyperparameter tunning')
print ('validation accuracy score : ',accuracy_score (target_valid, y_prediction_valid_3)*100)
print ('testing accuracy score    : ',accuracy_score (target_test, y_prediction_test_3)*100)

RandomForest without hyperparameter tunning
validation accuracy score :  78.0715396578538
testing accuracy score    :  81.49300155520996


- Findings :
  - the range of validation and testing accuracy scores is quite close
  - data is not overfitting

#### DecisionTree Regression with tunning hyperparameters

In [13]:
#looking optimum hyperparameter value
max_dept_list = [1,2,3,4,5,6,7,8,9]
for md in max_dept_list:
    dtree = DecisionTreeClassifier(max_depth=md)
    dtree.fit(features_train, target_train)

    y_prediction_valid = dtree.predict(features_valid)
    y_prediction_train = dtree.predict(features_train)

    acc_train = accuracy_score (target_train, y_prediction_train)*100
    acc_valid = accuracy_score (target_valid, y_prediction_valid)*100

    print(f'for max depth {md} ')
    print(f'   the training accuration value is : {acc_train} and validation accuration value is : {acc_valid} ')
    print('')


for max depth 1 
   the training accuration value is : 75.15560165975104 and validation accuration value is : 75.11664074650078 

for max depth 2 
   the training accuration value is : 77.69709543568464 and validation accuration value is : 76.98289269051321 

for max depth 3 
   the training accuration value is : 79.51244813278008 and validation accuration value is : 78.22706065318819 

for max depth 4 
   the training accuration value is : 81.32780082987551 and validation accuration value is : 78.38258164852256 

for max depth 5 
   the training accuration value is : 82.4688796680498 and validation accuration value is : 78.53810264385692 

for max depth 6 
   the training accuration value is : 83.92116182572614 and validation accuration value is : 79.16018662519441 

for max depth 7 
   the training accuration value is : 84.64730290456431 and validation accuration value is : 78.69362363919129 

for max depth 8 
   the training accuration value is : 85.4253112033195 and validation accu

findings :
- max depth 3 has the clossest value of training accuracy and validation accuracy score is 0.353
- max depth 3 will be chosen as best_max_depth value

In [14]:
#hyperparameters tunning
best_max_depth = 3
dtree = DecisionTreeClassifier(max_depth=best_max_depth)
dtree.fit(features_train, target_train)

y_prediction_train_2 = dtree.predict(features_train)
y_prediction_valid_2 = dtree.predict(features_valid)
y_prediction_test_2 = dtree.predict(features_test)

print(accuracy_score (target_train, y_prediction_train_2)*100)
print(accuracy_score (target_valid, y_prediction_valid_2)*100)
print(accuracy_score (target_test, y_prediction_test_2)*100)

79.51244813278008
78.22706065318819
79.78227060653188


DecisionTree regression findings :

- with max_depth = 3, the quality of model with hyperparameters tuning :
  - validation accuracy score increasing from 70.13996889580093 to 79.93779160186625
  - testing accuracy score increasing from 72.00622083981337 to 77.91601866251943

#### Random Forrest Regression

In [19]:
#looking optimum hyperparameter value
max_dept_list = [1,2,3,4,5,6,7,8,9]
n_estimator_list = [100,200,300,400,500]
for md in max_dept_list:
    for nest in n_estimator_list:
        rf = RandomForestClassifier(max_depth=md, n_estimators=nest)
        rf.fit(features_train, target_train)

        y_prediction_valid = rf.predict(features_valid)
        
        acc_valid = accuracy_score (target_valid, y_prediction_valid)*100

        print(f'for max depth {md} and estimator {nest}')
        print(f'   the validation accuration value is : {acc_valid} ')
        print('')


for max depth 1 and estimator 100
   the validation accuration value is : 75.58320373250389 

for max depth 1 and estimator 200
   the validation accuration value is : 75.58320373250389 

for max depth 1 and estimator 300
   the validation accuration value is : 75.27216174183515 

for max depth 1 and estimator 400
   the validation accuration value is : 75.58320373250389 

for max depth 1 and estimator 500
   the validation accuration value is : 75.89424572317263 

for max depth 2 and estimator 100
   the validation accuration value is : 78.69362363919129 

for max depth 2 and estimator 200
   the validation accuration value is : 78.38258164852256 

for max depth 2 and estimator 300
   the validation accuration value is : 78.84914463452566 

for max depth 2 and estimator 400
   the validation accuration value is : 79.00466562986003 

for max depth 2 and estimator 500
   the validation accuration value is : 78.84914463452566 

for max depth 3 and estimator 100
   the validation accurati

In [16]:
max_depth_best = 9
n_estimators_best = 200

rf = RandomForestClassifier (max_depth=max_depth_best,n_estimators=n_estimators_best)
rf.fit(features_train, target_train)

y_prediction_valid_2 = rf.predict(features_valid)
y_prediction_test_2 = rf.predict(features_test)

print(accuracy_score (target_valid, y_prediction_valid_2)*100)
print(accuracy_score (target_test, y_prediction_test_2)*100)

79.78227060653188
81.33748055987559


RandomForest regression findings :

- the optimum testing accuracy score at : max_depth_best = 9 and n_estimators_best = 200
- the quality of model with hyperparameters tuning :
  - validation accuracy score increasing from 79.62674961119751 to 81.64852255054433
  - testing accuracy score increasing from 78.38258164852256 to 80.87091757387248
  - the hyperparameters tunning result is slightly changing on every iteration/trial

### Perform a sanity check on the model

In [17]:
#checking is_ultra column
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [18]:
# percentage (%) of is_ultra data
df['is_ultra'].value_counts()/df.shape[0]*100

0    69.352831
1    30.647169
Name: is_ultra, dtype: float64

- finding :
  - dataset proportion is unbalance, '0' value is dominant with 70% portion
  - unbalance dataset will generate unproper model
  - data accuration score will be low

## Conclussion

Based on the MEGALINE customer dataset we have several point :

- Finding :
  - customer with SMART plan package is dominant (70% of total customer using smart plan package)
  - prior generating the model, dataset separate in to: training set, validation set and testing set
    - dataset shape and features (training set, validation set and testing set) shape has the same value ( 1928 + 643 + 643 = 3124 ) 
    - the range of validation and testing accuracy scores is quite close
    - data is not overfitting
  - DecisionTree regression findings :
    - with max_depth = 3, the quality of model with hyperparameters tuning :
    - validation accuracy score increasing from 70.13996889580093 to 79.93779160186625
    - testing accuracy score increasing from 72.00622083981337 to 77.91601866251943
  - RandomForest regression findings :
    - the optimum testing accuracy score at : max_depth_best = 9 and n_estimators_best = 200
    - the quality of model with hyperparameters tuning :
      - validation accuracy score increasing from 79.62674961119751 to 81.64852255054433
      - testing accuracy score increasing from 78.38258164852256 to 80.87091757387248
      - the hyperparameters tunning result is slightly changing on every iteration/trial

- Insight  
  - unbalance dataset will generate unproper model
  - data accuration score will be low
  - data is not overfitting

- Recomendation
  - RandomForest regression with max_depth_best = 9 and n_estimators_best = 200 is the best model to help MEGALINE understand the customer's behaviours chossing their plan package