# Exploratory Data Analysis

## Aim :
- Understand the data ("A small step forward is better than a big one backwards")
- Begin to develop a modelling strategy
- Optimize the model having the best accuracy on the train set

## Target

price_range

## Features

* id
* battery_power : Total energy a battery can store in one time measured in mAh
* blue : Has bluetooth or not
* clock_speed : speed at which microprocessor executes instructions 
* dual_sim : Has dual sim support or not
* fc : Front Camera mega pixels
* four_g : Has 4G or not
* int_memory : Internal Memory in Gigabytes
* m_dep : Mobile Depth in cm
* mobile_wt : Weight of mobile phone
* n_cores : Number of cores of processor
* pc : Primary Camera mega pixels
* px_height : Pixel Resolution Height
* px_width : Pixel Resolution Width
* ram : Random Access Memory in Megabytes
* sc_h : Screen Height of mobile in cm
* sc_w : Screen Width of mobile in cm
* talk_time : longest time that a single battery charge will last when you are 
* three_g : Has 3G or not
* touch_screen : Has touch screen or not
* wifi : Has wifi or not

## Base Checklist
#### Shape Analysis :
- **target feature** : price_range
- **rows and columns (train set)** : 2000 , 21
- **rows and columns (test set)** : 1000 , 21
- **features types** : qualitatives : 0 , quantitatives : 20
- **NaN analysis** :
    - NaN  : 0%

#### Columns Analysis :
- **Target Analysis** :
    - Balanced (Yes/No) : Yes
    - Percentages : 4 classes, repr. 25% of the dataset each (perfectly balanced)
- **Categorical values**
    - There is 6 binary categorical features (not inluding the target)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
data_train = pd.read_csv('../input/mobile-price-classification/train.csv')
data_test = pd.read_csv('../input/mobile-price-classification/test.csv')
df = data_train.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

In [None]:
df.dtypes.value_counts() # Compte les nombre de types de variables

In [None]:
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.isna(),cbar=False)
plt.show()

## Examining target and features

In [None]:
df['price_range'].value_counts(normalize=True) #Classes déséquilibrées

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure()
    sns.displot(df[col],kind='kde',height=3)
    plt.show()

# A bit of data engineering ...

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    print(f'{col :-<50} {df[col].unique()}')

In [None]:
def encoding(df):
    code = {
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)
        
    return df

def imputation(df):
    
    #df = df.dropna(axis=0)
    df = df.fillna(df.mean())
    
    return df

def feature_engineering(df):
    useless_columns = []
    for feature in useless_columns:
        if feature in df:
            df = df.drop(feature,axis=1)
    return df


# No changes on the data sets to do here (just need to scale features)

In [None]:
def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('price_range',axis=1)
    y = df['price_range'].astype(int)
      
    return df,X,y

In [None]:
df=data_train.copy()
df,X,y = preprocessing(df)
df.head()

In [None]:
Range_0 = df[y == 0]
Range_1 = df[y == 1]
Range_2 = df[y == 2]
Range_3 = df[y == 3]

# Detailed analysis

In [None]:
corr = df.corr(method='pearson').abs()

fig = plt.figure(figsize=(30,20))
sns.heatmap(corr, annot=True, cmap='tab10', vmin=0, vmax=+1)
plt.title('Pearson Correlation')
plt.show()

In [None]:
df.corr()['price_range'].abs().sort_values()

In [None]:
for col in df.columns:
    plt.figure(figsize=(4,4))
    plt.title(col)
    sns.distplot(Range_0[col],label = "Range 0")
    sns.distplot(Range_1[col],label = "Range 1")
    sns.distplot(Range_2[col],label = "Range 2")
    sns.distplot(Range_3[col],label = "Range 3")
    plt.legend()
    plt.show()

# Modelling

In [None]:
from sklearn.model_selection import train_test_split
df = data_train.copy()
trainset, valset = train_test_split(df, test_size=0.2, random_state=0)
print(trainset['price_range'].value_counts())
print(valset['price_range'].value_counts())

In [None]:
_, X_train, y_train = preprocessing(trainset)
_, X_val, y_val = preprocessing(valset)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

In [None]:
preprocessor = make_pipeline(StandardScaler())

PCAPipeline = make_pipeline(preprocessor, PCA(n_components=2,random_state=0))

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag'))

## PCA Analysis

In [None]:
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(X))
PCA_df = pd.concat([PCA_df, data_train['price_range']], axis=1)
PCA_df.head()

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(PCA_df[0],PCA_df[1],hue=PCA_df['price_range'],palette=sns.color_palette("tab10", 4))
plt.show()

## Classification problem

In [None]:
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}

In [None]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV

def evaluation(model):
    model.fit(X_train, y_train)
    # calculating the probabilities
    y_pred_proba = model.predict_proba(X_val)

    # finding the predicted valued
    y_pred = np.argmax(y_pred_proba,axis=1)
    print('Accuracy = ', accuracy_score(y_val, y_pred))
    print('-')
    print(confusion_matrix(y_val,y_pred))
    print('-')
    print(classification_report(y_val,y_pred))
    print('-')
    
    N, train_score, val_score = learning_curve(model, X_train, y_train, cv=4, scoring='accuracy', train_sizes=np.linspace(0.1,1,10))
    
    plt.figure(figsize=(8,6))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, val_score.mean(axis=1), label='validation score')
    plt.legend()

In [None]:
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    evaluation(model)

# Conclusion : 95.5% Accuracy reached using LogisticRegressor

For the 5 models tested hereabove, here are the accuracies :
- LogisticRegressor : 95.5%
- SVM : 90%
- RandomForest : 86%
- AdaBoost : 70%
- KNN : 51%

### The best model is the LogisticRegressor

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# Binarize the output
y_train = label_binarize(y_train, classes=[0, 1, 2, 3])
y_val = label_binarize(y_val, classes=[0, 1, 2, 3])
n_classes = y_train.shape[1]

# Learn to predict each class against the other
classifier = OneVsRestClassifier(LRPipeline)
y_score = classifier.fit(X_train, y_train).decision_function(X_val)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_val[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_val.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# # Plot of a ROC curve for a specific class
# plt.figure()
# plt.plot(fpr[2], tpr[2], label='ROC curve (area = %0.2f)' % roc_auc[2])
# plt.plot([0, 1], [0, 1], 'k--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver operating characteristic for class 2')
# plt.legend(loc="lower right")
# plt.show()

# Plot ROC curve
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]))
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

# Predictions

In [None]:
df_test=data_test.copy()
df_test = df_test.drop('id',axis=1)
df_test.head()

In [None]:
predicted_proba = classifier.predict_proba(df_test)
i=0
for price_range in predicted_proba:
    print("Id :",i," - Price_range :",np.argmax(price_range,axis=0),"with probability",round(max(price_range)*100),"%")
    i+=1

In [None]:
predictions = classifier.predict(df_test)
predicted_range=[]
for price_range in predictions :
    predicted_range.append(np.argmax(price_range))
print(predicted_range)

In [None]:
df_test['Price_range'] = predicted_range
df_test.head()

# If you like please upvote !
## Also check my other notebooks :
#### EDA & Modelling (95.5% acc.) - Mobile price : https://www.kaggle.com/dorianvoydie/eda-modelling-95-5-acc-mobile-price
#### EDA & Modelling - Mice (100% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-mice-100-acc
#### EDA & Modelling - Breast Cancer Detection : https://www.kaggle.com/dorianvoydie/eda-modelling-breast-cancer-detection
#### Accuracy 99% - Trying several models : https://www.kaggle.com/dorianvoydie/accuracy-99-trying-several-models
#### Meteo Forecasting : https://www.kaggle.com/dorianvoydie/meteo-forecasting
#### EDA & Modelling - Heart Attack 90% Accuracy Score : https://www.kaggle.com/dorianvoydie/eda-modelling-heart-attack-90-accuracy-score