# **Discriminative Feature Selection**

# FEATURE SELECTION

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

We are going to understand it with a practice example. Steps are as follows :

- Import important libraries

- Importing data

- Data Preprocessing

    - Price

    - Size

    - Installs

- Discriminative Feature Check

    - Reviews

    - Price

**1. Import Important Libraries**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

**2. Importing Data**

Today we will be working on a playstore apps dataset with ratings. Link to the dataset --> https://www.kaggle.com/lava18/google-play-store-apps/data

In [6]:
df = pd.read_csv('googleplaystore.csv',encoding='unicode_escape')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite â FREE Live Cool Themes, Hid...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


**3. Data Preprocessing**

Let us have a look at all the datatypes first :

In [7]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

We see that all the columns except 'Rating' are object datatype. We want those columns also as numeric as they dont make sense when they are in object form.Let us start with the 'Price' column.

**i) Price** 

When we saw the head of the dataset, we only see the 0 values in 'Price' column. Let us have a look at the rows with non zero data. As the 'Price column is object type, we compare the column with '0' instead of 0. 

In [8]:
df[df['Price']!='0'].head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
234,TurboScan: scan documents and receipts in PDF,BUSINESS,4.7,11442,6.8M,"100,000+",Paid,$4.99,Everyone,Business,"March 25, 2018",1.5.2,4.0 and up
235,Tiny Scanner Pro: PDF Doc Scan,BUSINESS,4.8,10295,39M,"100,000+",Paid,$4.99,Everyone,Business,"April 11, 2017",3.4.6,3.0 and up
290,TurboScan: scan documents and receipts in PDF,BUSINESS,4.7,11442,6.8M,"100,000+",Paid,$4.99,Everyone,Business,"March 25, 2018",1.5.2,4.0 and up
291,Tiny Scanner Pro: PDF Doc Scan,BUSINESS,4.8,10295,39M,"100,000+",Paid,$4.99,Everyone,Business,"April 11, 2017",3.4.6,3.0 and up
427,Puffin Browser Pro,COMMUNICATION,4.0,18247,Varies with device,"100,000+",Paid,$3.99,Everyone,Communication,"July 5, 2018",7.5.3.20547,4.1 and up


We see that the 'Price' column has dollar sign in the beginning for the apps which are not free. Hence we cannot directly convert it to numeric type. We will first have to remove the $ sign so that all datas are uniform and can be converted.

We use the replace function over here to replace the dollar sign by blank. Notice that we had to convert the column to string type from object type as the replace function is only applicable on string functions.

In [10]:
df['Price'] = df['Price'].str.replace('$','', regex=False)
df[df['Price']!='0'].head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
234,TurboScan: scan documents and receipts in PDF,BUSINESS,4.7,11442,6.8M,"100,000+",Paid,4.99,Everyone,Business,"March 25, 2018",1.5.2,4.0 and up
235,Tiny Scanner Pro: PDF Doc Scan,BUSINESS,4.8,10295,39M,"100,000+",Paid,4.99,Everyone,Business,"April 11, 2017",3.4.6,3.0 and up
290,TurboScan: scan documents and receipts in PDF,BUSINESS,4.7,11442,6.8M,"100,000+",Paid,4.99,Everyone,Business,"March 25, 2018",1.5.2,4.0 and up
291,Tiny Scanner Pro: PDF Doc Scan,BUSINESS,4.8,10295,39M,"100,000+",Paid,4.99,Everyone,Business,"April 11, 2017",3.4.6,3.0 and up
427,Puffin Browser Pro,COMMUNICATION,4.0,18247,Varies with device,"100,000+",Paid,3.99,Everyone,Communication,"July 5, 2018",7.5.3.20547,4.1 and up


**ii) Size**

As we see the 'Size' column, we see that the value ends with the letter 'M' for mega. We want to convert the size to numeric value to use in the dataset. Hence we will need to remove the letter 'M'.

For this, we convert the column to string and omit the last letter of the string and save the data in 'Size' column.

Notice from the previous head that we saw, that the 'Size' for row 427 is given as varies with device. We obviously cannot convert such data to numeric. We will see how to deal with it later.

In [0]:
df['Size'] = df['Size'].str[:-1]
df.head()

Unnamed: 0,App,Rating,Reviews,Size,Installs,Price
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,"10,000+",0
1,Coloring book moana,3.9,967,14.0,"500,000+",0
2,"U Launcher Lite Ã¢ÂÂ FREE Live Cool Themes, ...",4.7,87510,8.7,"5,000,000+",0
3,Sketch - Draw & Paint,4.5,215644,25.0,"50,000,000+",0
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,"100,000+",0


**iii) Installs**

If we see the 'Installs' column, there are 2 major changes that we need to make to convert it to numeric. We have to remove the '+' sign from the end of the data as well as remove the commas before converting to numeric.

To remove the last letter, we apply the same procedure as for the 'Size' column :

In [0]:
df['Installs'] = df['Installs'].str[:-1]
df.head()

Unnamed: 0,App,Rating,Reviews,Size,Installs,Price
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,10000,0
1,Coloring book moana,3.9,967,14.0,500000,0
2,"U Launcher Lite Ã¢ÂÂ FREE Live Cool Themes, ...",4.7,87510,8.7,5000000,0
3,Sketch - Draw & Paint,4.5,215644,25.0,50000000,0
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,100000,0


For the removal of commas, we will use the replace function to replace commas with blank.

Replace function only works on string, hence we access the values of the series as string before applying the replace function :

In [0]:
df['Installs'] = df['Installs'].str.replace(',','')
df.head()

Unnamed: 0,App,Rating,Reviews,Size,Installs,Price
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,10000,0
1,Coloring book moana,3.9,967,14.0,500000,0
2,"U Launcher Lite Ã¢ÂÂ FREE Live Cool Themes, ...",4.7,87510,8.7,5000000,0
3,Sketch - Draw & Paint,4.5,215644,25.0,50000000,0
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,100000,0


Now, we will finally convert all the data to numeric type using the to_numeric function. Notice that we have used the errors='coerce' parameter. This parameter converts all the data which cannot be converted to numeric into NaN. For example the 'Size' in row 427 cannot be converted to int. Hence it will be converted to NaN. After that we take a look at the datatypes of the columns again.

In [0]:
df['Reviews'] = pd.to_numeric(df['Reviews'],errors='coerce')
df['Size'] = pd.to_numeric(df['Size'],errors='coerce')
df['Installs'] = pd.to_numeric(df['Installs'],errors='coerce')
df['Price'] = pd.to_numeric(df['Price'],errors='coerce')
df.dtypes

App          object
Rating      float64
Reviews     float64
Size        float64
Installs    float64
Price       float64
dtype: object

Now we will see and work with all the NaN values. Let us first have a look at all the NaN values in the dataset :

In [0]:
df.isna().sum()

App            0
Rating      1474
Reviews        1
Size        1696
Installs       2
Price          1
dtype: int64

As rating is the output of our dataset, we cannot have that to be NaN. Hence we will remove all the rows with 'Rating' as NaN :

In [0]:
df = df[df['Rating'].isna()==False]
df.isna().sum()

App            0
Rating         0
Reviews        1
Size        1638
Installs       1
Price          1
dtype: int64

This is the final preprocessed dataset that we obtained :

In [0]:
df.head()

Unnamed: 0,App,Rating,Reviews,Size,Installs,Price
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159.0,19.0,10000.0,0.0
1,Coloring book moana,3.9,967.0,14.0,500000.0,0.0
2,"U Launcher Lite Ã¢ÂÂ FREE Live Cool Themes, ...",4.7,87510.0,8.7,5000000.0,0.0
3,Sketch - Draw & Paint,4.5,215644.0,25.0,50000000.0,0.0
4,Pixel Draw - Number Art Coloring Book,4.3,967.0,2.8,100000.0,0.0


**4. Discriminative Feature Check**

Now we will move on to checking the discriminative feature checking, to see which feature is good and which is not. We will start with the 'Reviews' column. For our case, we will take rating > 4.3 as a good rating. We take that value because as we see in the following stats, the rating is divided 50:50 at that value.

Before we do that, let us have a look at the statistics of the whole table :

In [0]:
df.describe()

Unnamed: 0,Rating,Reviews,Size,Installs,Price
count,9367.0,9366.0,7729.0,9366.0,9366.0
mean,4.193338,514049.8,37.284513,17897440.0,0.960928
std,0.537431,3144042.0,93.509493,91238220.0,15.816585
min,1.0,1.0,1.0,1.0,0.0
25%,4.0,186.25,6.1,10000.0,0.0
50%,4.3,5930.5,16.0,500000.0,0.0
75%,4.5,81532.75,37.0,5000000.0,0.0
max,19.0,78158310.0,994.0,1000000000.0,400.0


**i) Reviews**

We will have to check for multiple values that which of them has the best rating distinction. We will start by comparing with the mean of the 'Reviews' column which is 514098.

We will use a new function over here known as crosstab. Crosstab allows us to have a frequency count across 2 columns or conditions.

We could also normalize the column results to obtain the conditional probability of P(Rating = HIGH | condition)

We have also turned on the margins to see the total frequency under that condition.

In [0]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>514098,rownames=['Ratings>4.3'],colnames=['Reviews>514098'],margins= True)

Reviews>514098,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,4967,335,5302
True,3383,682,4065
All,8350,1017,9367


We see that the number of ratings in the case of Reviews > 514098 is very less (close to 10%).

Hence it is preferred to take the 50 percentile point rather than the mean to be the pivot point. Let us now take the 50 percentile point which is 5930 reviews in this case. So let us take a look at that :

In [0]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>5930,rownames=['Ratings>4.3'],colnames=['Reviews>5930'],margins= True)

Reviews>5930,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,2906,2396,5302
True,1778,2287,4065
All,4684,4683,9367


Now we see that the number of ratings is equal for both high and low reviews. So we will take the 50 percentile point to start from now on. Let us now look at the conditional probability :

In [0]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>5930,rownames=['Ratings>4.3'],colnames=['Reviews>5930'],margins= True,normalize='columns')

Reviews>5930,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,0.62041,0.511638,0.56603
True,0.37959,0.488362,0.43397


There is not much difference between P(Ratings=HIGH|Reviews<5930) and P(Ratings=HIGH|Reviews>5930) so this is a bad feature.

Let us increase the value of the pivot for ratings to 80000 and check again. We dont need to check for the percentage being too low as we are almost at 75 percentile mark.

In [48]:
pd.crosstab(df['Rating']>4.3,df['Reviews']>80000,rownames=['Ratings>4.3'],colnames=['Reviews>80000'],margins= True,normalize='columns')

Reviews>80000,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,0.613442,0.42518,0.56603
True,0.386558,0.57482,0.43397


Now we see that there is a good difference in the probabilities and hence Rating>80000 is a good feature.

**ii) Price**

We will do the same for 'Price' column to find out the best distinctive feature. We see that in this case, even the 75 percentile mark also points to 0. Hence in this case, we will classify the data as Free or not :

In [49]:
pd.crosstab(df['Rating']>4.3,df['Price']==0,rownames=['Ratings>4.3'],colnames=['Price=$0'],margins= True)

Price=$0,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,288,5014,5302
True,360,3705,4065
All,648,8719,9367


This shows us that it is very difficult to use the Price as a feature. Hence it is a doubtful feature. If then also we want to force this as a feature, let us see the conditional probability :

In [50]:
pd.crosstab(df['Rating']>4.3,df['Price']==0,rownames=['Ratings>4.3'],colnames=['Price=$0'],margins= True,normalize='columns')

Price=$0,False,True,All
Ratings>4.3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,0.444444,0.575066,0.56603
True,0.555556,0.424934,0.43397


We see that there is not much difference in probability either, hence this would serve as a bad feature in any case.

This is the end of this tutorial. Now you can move on to assignment 7 in which you have to check the other 2 distinctive features.