## Introduction

In this notebook, we will be looking at one of the most important steps in building a machine learning model - **Feature Selection**. How we select our features from the dataset will determine how accurate the model will be when trained and being tested.

We will use some statistical tests like Chi-square to select features which will only improve our model.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

## Load and Prepare the Dataset

In [1]:
# Load the data

data = pd.read_csv('../input/mobile-price-classification/train.csv')
data.head()

In [1]:
# Split into features and target variables

X = data.iloc[:, :-1]
y = data.iloc[:, -1]

## Compare and Select N-best features using SelectKBest

In [1]:
# Call the constructor to select 10 best features according to their Chi-square tests against the target variable
best_features = SelectKBest(score_func=chi2, k = 10)

fit = best_features.fit(X, y)

In [1]:
chi_scores = pd.DataFrame(fit.scores_)
columns = pd.DataFrame(X.columns)

featureScores = pd.concat([columns, chi_scores], axis = 1)
featureScores.columns = ['Feature', 'Score']

featureScores.sort_values(by = 'Score', ascending = False)

As we can see from the above DataFrame and from our own experience, the RAM for a smartphone is the most significant feature to determine its price range. Also, some really good indicators are: dimensions of the phone, battery power, and its internal memory.

## Feature Importance

Another techinque which we will be using is the feature importance. Here, we will be using a tree-based model to assign an importance to each of the feature in the dataset.

In [1]:
# Initialize a RandomForestClassifier model and fit our dataset
model = RandomForestClassifier()
model.fit(X, y)

In [1]:
# From this model, we get importance scores for each feature
model.feature_importances_

In [1]:
featureImp = pd.DataFrame(model.feature_importances_, index = X.columns, columns = ['Importance'])
featureImp = featureImp.sort_values(by = 'Importance', ascending = False)
featureImp

Again, we get similar result from the model also. RAM is the most important feature for determining the price range of any mobile.

In [1]:
# Let's plot the importance score for the features
plt.figure(figsize = (10, 5))
plt.xticks(rotation = 90)
plt.bar(featureImp.index, featureImp['Importance'])

From the above plot, we can see that after a fwe features, there isn't much importance left and we can select the first 4-5 features and our model will be sufficiently accurate.