#  Practicing Feature Selection and Feature Extraction

In this example we are going to run a few tests regarding Feature Selection and Extraction methodologies.

***

## Dataset used:

### Mobile Price Classification
Data contains various different features and our job is to identify WHICH ones to use and HOW to use them.

**Source:** https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv



In [25]:
import numpy as np
import pandas as pd
import IPython as ipy
import matplotlib as plt

In [3]:
dataSet = pd.read_csv('train.csv')
dataSet.head(3)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2


In [6]:
dataSet.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

### Important functions to keep in mind
* **isnull(DataFrame)** - ***returns**: equal dimensions list with 'True' where NULL and 'False' where Not-NULL*

In [None]:
dataSet['sc_h']

In [31]:
# Returns list of equal length with 'TRUE' or 'FALSE' in each row where isNULL is applicable
pd.isnull(dataSet['battery_power'])

# Returns count of both 'TRUE' or 'FALSE' in each row where isNULL is applicable
pd.isnull(dataSet['battery_power']).value_counts()

False    2000
Name: battery_power, dtype: int64

In [38]:
dataSet.groupby(['battery_power']).filter(lambda x: x > 1200)

TypeError: filter function returned a DataFrame, but expected a scalar bool

In [42]:
data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)
 
df.head()

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
1,Prof,B,19,18,Male,139750
2,Prof,B,20,16,Male,173200
3,AsstProf,B,4,3,Male,79750
4,Prof,B,45,39,Male,115000
5,Prof,B,40,41,Male,141500


In [48]:
# Grouping by one factor
df_rank = df.groupby('rank')

In [46]:
df_rank.count()

Unnamed: 0_level_0,discipline,yrs.since.phd,yrs.service,sex,salary
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AssocProf,64,64,64,64,64
AsstProf,67,67,67,67,67
Prof,266,266,266,266,266


In [50]:
df.groupby('rank').describe()

Unnamed: 0_level_0,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.since.phd,yrs.service,yrs.service,yrs.service,yrs.service,yrs.service,salary,salary,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
rank,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AssocProf,64.0,15.453125,9.652584,6.0,10.0,12.0,17.25,49.0,64.0,11.953125,...,11.0,53.0,64.0,93876.4375,13831.699844,62884.0,82475.0,95626.5,104226.25,126431.0
AsstProf,67.0,5.104478,2.541381,1.0,3.5,4.0,7.0,11.0,67.0,2.373134,...,3.0,6.0,67.0,80775.985075,8174.112637,63100.0,74000.0,79800.0,88597.5,97032.0
Prof,266.0,28.300752,10.10883,11.0,20.0,28.0,36.75,56.0,266.0,22.815789,...,30.0,60.0,266.0,126772.109023,27718.674999,57800.0,105975.25,123321.5,145080.5,231545.0


In [51]:
s = pd.Series([1, 2, 3, 4])
s.describe()

count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
dtype: float64

In [52]:
s = pd.Series([3, 2, 1, 4])
s.describe()

count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
dtype: float64

In [92]:
# Creating Series where each squared bracket contains a LIST as tuple of Column 0
ba = pd.Series([[1,2,3,4], [4,3,2,1]])
ba_df = pd.DataFrame(ba)

col2 = {'col2': [2,4]}
ba_df = ba_df.join(pd.DataFrame(col2))
ba_df

pandas.core.series.Series

In [66]:
dat1 = pd.DataFrame({'dat1': [9,5]})
dat2 = pd.DataFrame({'dat2': [7,6]})
dat1.join(dat2)

Unnamed: 0,dat1,dat2
0,9,7
1,5,6


In [111]:
columnWise = pd.DataFrame({'c1': [1,2,3,4], 'c2': [4,3,2,1]})
columnWise


Unnamed: 0,c1,c2
0,1,4
1,2,3
2,3,2
3,4,1


In [122]:
# Adds 1 to 1st row, 2 to 2nd row, 3 to 3rd, and 4 to 4th row of all values of c1 and c2
columnWise.add([1,2,3,4], axis=0)
columnWise.add([1,2,3,4], axis='rows')

# Adds 1 to all values of c1 and 2 to all values of c2
columnWise.add([1,2], axis=1)
columnWise.add([1,2], axis='columns');