# Mobile Price Range Classification

Hi, In this notebook I've used C-Support Vector Classification algorithm to classify mobile price from the [mobile-price-classification](https://www.kaggle.com/iabhishekofficial/mobile-price-classification) dataset. I've mainly focused on model building, feature selection, hyperparameter optimization and using K-fold cross validation techique to obtain a high accuracy results using a single model.

1. Building a base line model with SVC
2. Selecting useful features and removing redundant ones.
3. Finding right set of parameters using optuna and 10 fold cross validation.
4. Selecting the best trial making predictions and averaging it from folds.

*Note*: This dataset is comparatively small. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using `LinearSVC` or `SGDClassifier` instead, possibly after a `Nystroem` transformer.

## **Useful imports and data loading**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import missingno as msno
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC

warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# Loading the data
df_train = pd.read_csv('/kaggle/input/mobile-price-classification/train.csv')
df_test = pd.read_csv('/kaggle/input/mobile-price-classification/test.csv')
df_train.head()

In [None]:
print('Train set shape:', df_train.shape)
print('Test set shape:', df_test.shape)

## **Basic Descriptive statistics and EDA**

In [None]:
df_train.info()

Lets look at some summary statistics of train data.

In [None]:
df_train.describe().T

In [None]:
msno.bar(df_train)
plt.show()

There are no missing values in data. 
  
Lets take a look at pairwise relationships between features.

In [None]:
sns.pairplot(data=df_train, hue='price_range')
plt.show()

In [None]:
sns.set(rc={'figure.figsize':(10,7)})

sns.stripplot(x="price_range", y="ram", data=df_train, dodge=True, palette='dark')
plt.show()

From the above plot we notice the price range is increasing with increase in ram.

In [None]:
sns.swarmplot(x="fc", y="ram", hue="price_range", data=df_train, dodge=True, palette='deep')
plt.show()

In [None]:
sns.scatterplot(x="ram", y="battery_power", hue="price_range", data=df_train, palette='deep')
plt.show()

In [None]:
sns.countplot(x = 'price_range' , data = df_train)
plt.show()

In [None]:
df_train['price_range'].value_counts()

We can see form the above countplot that the prices are uniformly distributed. 
  
  Now lets see how our feature variables are correlated using corelation matrix.

In [None]:
corr = df_train.corr()
sns.heatmap(corr, cmap="YlGnBu", linewidths=.5)
plt.show()

In [None]:
corr.sort_values(by=["price_range"],
                 ascending=False).iloc[0].sort_values(ascending=False)

We notice Ram has highest correlation with price.

## **Creating train and validation sets for training**  

Lets break down our train data into two parts one used for training and other for validation. Note we would also like to scale our data before sending as input to our model. For this we will use Min-Max Scaler.

*Note: It is important to fit scaler only on train data and then transform train and test data to prevent any data leak.*

In [None]:
x = df_train.drop(["price_range"],axis=1)
y = df_train["price_range"].values

x_train, x_val, y_train, y_val = train_test_split(x,y,test_size = 0.2,random_state=420)

min_max_scaling = preprocessing.MinMaxScaler()
x_train = min_max_scaling.fit_transform(x_train)
x_val = min_max_scaling.transform(x_val)

## **Creating base line for our model**

Lets go ahead and fit our model to standard SVC to the train data.

In [None]:
from sklearn.svm import SVC
svm_model = SVC(random_state=420)
svm_model.fit(x_train,y_train)
print("train accuracy:",svm_model.score(x_train,y_train))
print("val accuracy:",svm_model.score(x_val,y_val))

Now with all the features selected we get train acc of 0.97 and val acc of 0.85. Lets see if we could do any better by selecting only important features and dropping those which are not so important

## **Feature Selection and Fine tuning Model**

The scikit-learn library provides a bunch of functions we can use for selecting the best features based on univariate statistical tests.

+ For regression: f_regression, mutual_info_regression
+ For classification: chi2, f_classif, mutual_info_classif

ANOVA a.k.a “analysis of variance” and is a parametric statistical hypothesis test for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not. 
  
F-test, is a class of statistical tests that calculate the ratio between variances values, such as the variance from two different samples or the explained and unexplained variance by a statistical test, like ANOVA. 
  
The ANOVA method is a type of F-statistic referred to here as an ANOVA f-test. ANOVA is used when one variable is numeric and one is categorical, such as numerical input variables and a classification target variable in a classification task. The results of this test can be used for feature selection where those features that are independent of the target variable can be removed from the dataset.


`SelectKBest` Removes all but the *k* highest scoring features.  
`f_classif` Compute the ANOVA F-value for the provided sample

We can define the `SelectKBest` class to use the `f_classif()` function and select the features based on highest `selector.scores_` values (higher the better) and then we plot a graph of accuracy score by incrementing one feature everytime to decide how many top features we need for a better score.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

train_accuracy = []

k = np.arange(1,21,1)

for i in k:
    selector = SelectKBest(f_classif, k=i)
    x_train_new = selector.fit_transform(x_train, y_train)
    svm_model.fit(x_train_new,y_train)
    train_accuracy.append(svm_model.score(x_train_new,y_train))
    
plt.plot(k,train_accuracy,color="blue",label="train")
plt.xlabel("k values")
plt.ylabel("train accuracy")
plt.legend()
plt.show()

From the graph we notice that we get higher accuracy when we fit the model with somewhere around 12-14 features. Lets obtain the right value for the features by inspecting the results.

In [None]:
result = pd.DataFrame(data= {'k_best_features': k,
                             'train_accuracy': train_accuracy})
result

In [None]:
print(result[result.train_accuracy == result.train_accuracy.max()])

Now we know the model outputs best train accuracy of 0.98 which is higher than the previously obtained 0.97 with its top 13 features selected. Now lets look what are those features.

In [None]:
selector = SelectKBest(f_classif, k = 13)

x_train_new = selector.fit_transform(x_train, y_train)
x_val_new = selector.transform(x_val)

top_features = x.columns.values[selector.get_support()].tolist()
print("Top features:",top_features)

Now lets plugin those features into the model and check for both train and validation accura

In [None]:
svm_model = SVC(random_state=420)
svm_model.fit(x_train_new,y_train)
print("train accuracy:",svm_model.score(x_train_new,y_train))
print("val accuracy:",svm_model.score(x_val_new,y_val))

We notice the val accuraccy has also improved from 0.85 to 0.89 which is good

## **Hyper Tuning Model with selected top features using optuna**

For more details: https://optuna.readthedocs.io/en/stable/index.html

As of now we have a knowledge of what features to select for our model. Now its time to improve on our model by tuning the parameters. In doing so we will be using 10 fold cross validation and calculate mean accuracy.

**TWO THINGS TO SPECIFY** : number of trials (20) and number of folds (10) for each trial.

**STEPS FROM HERE:**
1. For each trial and fold we will be saving both the scaler which transforms the data and the models used to calculate accuracy.

2. After finishing the trials. We will select the *best trial* i.e. which gives us the best mean accuracy of Out Of Fold predictions.

3. Next step would be to transform the test data with the scalers and get predictions from the models that we have saved from the best trial.

4. Finally we will use mode to select the most appropriate category from the predictions.

In [None]:
# Preparing the train data

x = np.array(df_train[top_features])
y = np.array(df_train["price_range"])

In [None]:
# Hyperparameter tuning and saving the models

import optuna
import pickle

def objective(trial):
    
    params = {
        'C':trial.suggest_loguniform('C', 1e-10, 1e10),
        'kernel':trial.suggest_categorical('kernel',["linear","rbf"]),
        'gamma':trial.suggest_categorical('gamma',["auto","scale"]),
        'decision_function_shape':trial.suggest_categorical(
            'decision_function_shape',["ovo","ovr"]
        )
    }
    
    svm_model = SVC(**params)
    
    skf = StratifiedKFold(n_splits=10)
    accuracy = []
    
    for fold, (train_index, val_index) in enumerate(skf.split(x,y)):
        x_train, y_train, = x[train_index], y[train_index]
        x_val, y_val = x[val_index], y[val_index]
        
        scaler = preprocessing.MinMaxScaler()
        scaler.fit(x_train)
        x_train = scaler.transform(x_train)
        x_val = scaler.transform(x_val)
        
        SCALER_PATH = f'scaler-t{trial.number}-f{fold}.pickle'
        pickle.dump(scaler, open(SCALER_PATH,'wb'))
        
        svm_model.fit(x_train,y_train)
        score = svm_model.score(x_val,y_val)
        accuracy.append(score)
        
        MODEL_PATH = f'model-t{trial.number}-f{fold}.pickle'
        pickle.dump(svm_model, open(MODEL_PATH,'wb'))

    print(f'Trial done: Accuracy scores values on folds: {accuracy}')
    accuracy_on_folds = np.mean(accuracy)
    
    return accuracy_on_folds

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

Now as the training is finished. Lets look at the results and see which trial gave us the best mean accuracy and what are the parameter values for the algorithm.

We have also saved the model states and transformation states of each trail and folds during training.

In [None]:
ls

After finding out the trail that gives best mean accuracy. We will use the models and transformations from the trial to predict the price category for the test data.

Note: There are 10 different transformation states and 10 model states which we have saved from the best trial.

In [None]:
num_folds = 10

predictions_from_folds = []

for i in range(num_folds):
    
    x_test = np.array(df_test[top_features])
    
    SCALER_FILE = f'scaler-t{study.best_trial.number}-f{i}.pickle'
    scaler = pickle.load(open(SCALER_FILE, 'rb'))
    x_test = scaler.transform(x_test)
    
    MODEL_FILE = f'model-t{study.best_trial.number}-f{i}.pickle'
    model = pickle.load(open(MODEL_FILE, 'rb'))
    y_test_preds = model.predict(x_test)
    predictions_from_folds.append(y_test_preds)
    
predictions_from_folds = np.array(predictions_from_folds)

Now we have predictions from 10 different models lets have a look how it looks like. Across columns we have the test sample data and across rows we have 10 models.

In [None]:
predictions_df = pd.DataFrame(predictions_from_folds)
predictions_df

We will now calculate mode for each item in the test data as the final prediction.

In [None]:
final_predictions = predictions_df.mode(axis=0)
final_predictions

Once we have calculated the mode from pandas module we see there are two rows in the output and most of the values of second row are NaN. The second row here, is for the second mode if it exists, i.e when two values have equal count. So lets find out if we have any.

In [None]:
final_predictions.isnull().sum(axis=1)

Finally we save the prediction results to a csv file

In [None]:
results = final_predictions.T
results.drop(results.columns[[1]], axis = 1, inplace = True)
results

In [None]:
results.to_csv('results.csv', header=False, index=False)

References:

https://scikit-learn.org/stable/modules/feature_selection.html  
https://machinelearningmastery.com/