#  Practicing Feature Selection and Feature Extraction

In this example we are going to run a few tests regarding Feature Selection and Extraction methodologies.

***

## Dataset used:

### Mobile Price Classification
Data contains various different features and our job is to identify WHICH ones to use and HOW to use them.

**Source:** https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv



In [2]:
import numpy as np
import pandas as pd
import IPython as ipy
import matplotlib as plt

In [3]:
dataSet = pd.read_csv('train.csv')
dataSet.head(3)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2


In [6]:
dataSet.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [10]:
dataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
battery_power    2000 non-null int64
blue             2000 non-null int64
clock_speed      2000 non-null float64
dual_sim         2000 non-null int64
fc               2000 non-null int64
four_g           2000 non-null int64
int_memory       2000 non-null int64
m_dep            2000 non-null float64
mobile_wt        2000 non-null int64
n_cores          2000 non-null int64
pc               2000 non-null int64
px_height        2000 non-null int64
px_width         2000 non-null int64
ram              2000 non-null int64
sc_h             2000 non-null int64
sc_w             2000 non-null int64
talk_time        2000 non-null int64
three_g          2000 non-null int64
touch_screen     2000 non-null int64
wifi             2000 non-null int64
price_range      2000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 328.2 KB


# Globally importing second dataset through URL
Second data set is being imported from resource where it contains sample set for testing features like GROUPBY and aggregations.

In [41]:
data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
urlData = pd.read_csv(data_url, index_col=0)

# Creating DataFrame from two lists
We can create dataframe using multiple lists, syntax is given below.

In [4]:
columnWise = pd.DataFrame({'c1': [1,2,3,4], 'c2': [4,3,2,1]})
columnWise


Unnamed: 0,c1,c2
0,1,4
1,2,3
2,3,2
3,4,1


# Merging a new/defined Column to existing DataFrame
Two functions are available to do so:
1. **JOIN**
2. **CONCAT**

In [125]:
# Creating Series where each squared bracket contains a LIST as tuple of Column 0
ba = pd.Series([[1,2,3,4], [4,3,2,1]])
ba_df = pd.DataFrame(ba)

# new column for merger
col2 = {'col2': [2,4]}

# use Pandas function JOIN or CONCAT to merge column to dataframe
ba_df = ba_df.join(pd.DataFrame(col2))
ba_df

Unnamed: 0,0,col2
0,"[1, 2, 3, 4]",2
1,"[4, 3, 2, 1]",4


# Important functions to keep in mind
* **ADD, SUB, MUL, DIV Operations** - Demonstrated in multiple notations available
* **pd.isnull(DataFrame)** - ***returns**: equal dimensions list with 'True' where NULL and 'False' where Not-NULL*
* **df.describe()** - ***returns**: basic aggregations of data like, min, max, std, count, percentiles, etc
* **df.info()** - This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
* **df.rename() & df.index.rename** - Used to rename column names of a dataframe in several formats. _index.rename_ renames the INDEX column of dataframe.

***
## ADD, SUB, MUL, DIV Operations
With similar syntax you can perform all four operations.
**NOTE:** Sign-based multiplication only works for _Scaler_ and _Column-wise_ execution. _ROW-WISE- operation is not supported.

In [5]:
# Adds 1 to 1st row, 2 to 2nd row, 3 to 3rd, and 4 to 4th row of all values of c1 and c2
columnWise.add([1,2,3,4], axis=0)
columnWise.add([1,2,3,4], axis='rows')

# Adds 1 to all values of c1 and 2 to all values of c2
columnWise.add([1,2], axis=1)
columnWise.add([1,2], axis='columns');


# Multiply all values
columnWise * 5
columnWise.mul(5)

# Multiply each c1 value by 2 and c2 value by 3
columnWise * [2, 3]
columnWise.mul([2,3], axis=1)
columnWise.mul([2,3], axis='columns')

# Multiply each row 1 with 5, row 2 with 6, row 3 with 7 and row 4 with 8 for both c1 & c2 columns
columnWise.mul([5, 6, 7, 8], axis=0)
columnWise.mul([5, 6, 7, 8], axis='rows');

## pd.isnull(DataFrame)

In [None]:
# Returns list of equal length with 'TRUE' or 'FALSE' in each row where isNULL is applicable
pd.isnull(dataSet['battery_power'])

# Returns count of both 'TRUE' or 'FALSE' in each row where isNULL is applicable
pd.isnull(dataSet['battery_power']).value_counts()

## df.describe()

In [150]:
# Testing DESCRIBE function on custom build series for analysis
# DESCRIBE function provides you with basic aggregations of data like count, mean, std, min, etc
s = pd.Series([1, 2, 3, 4])
s.describe();

## df.info()
This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [12]:
urlData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 1 to 397
Data columns (total 6 columns):
rank             397 non-null object
discipline       397 non-null object
yrs.since.phd    397 non-null int64
yrs.service      397 non-null int64
sex              397 non-null object
salary           397 non-null int64
dtypes: int64(3), object(3)
memory usage: 17.1+ KB


## df.rename() & df.index.rename
**df.RENAME** function is used to rename column names of a dataframe. Whereas **df.index.rename** is used to rename the index column of your dataframe.
**NOTE:** Be sure to assign result of _df.RENAME_ back to original dataframe to register the change.

In [43]:
# Rename the INDEX column of dataframe
urlData.index.name = 'NewLife'

# Rename the RANK and SEX columns to new names accordingly
urlData = urlData.rename(index=str, columns={"rank": "seniority", "sex": "gender"})

# Preview the changed column names
urlData.head(3)

Unnamed: 0_level_0,seniority,discipline,yrs.since.phd,yrs.service,gender,salary
NewLife,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Prof,B,19,18,Male,139750
2,Prof,B,20,16,Male,173200
3,AsstProf,B,4,3,Male,79750


## df.sort_values('column_name')

# df.transpose() or df.T

# df.aggregations

# Free-form CODE
The free form code to implement some of the previously discussed functionality.

In [51]:
urlData.index.name = 'NewLife'
urlData = urlData.rename(index=str, columns={"rank": "seniority", "sex": "gender"})
# print(urlData.index.name)
testUrlData = urlData.head(10)
testUrlData
# urlData.rank.value_counts()
# testUrlData.groupby('seniority', 'discipline')['yrs.service'].avg()



Unnamed: 0_level_0,seniority,discipline,yrs.since.phd,yrs.service,gender,salary
NewLife,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Prof,B,19,18,Male,139750
2,Prof,B,20,16,Male,173200
3,AsstProf,B,4,3,Male,79750
4,Prof,B,45,39,Male,115000
5,Prof,B,40,41,Male,141500
6,AssocProf,B,6,6,Male,97000
7,Prof,B,30,23,Male,175000
8,Prof,B,45,45,Male,147765
9,Prof,B,21,20,Male,119250
10,Prof,B,18,18,Female,129000


In [8]:
Grouping by one factor
rangGroup = urlData.groupby('rank')

# urlData.groupby('rank', axis=1)
urlData.groupby('rank')['rank'].count()

rank
AssocProf     64
AsstProf      67
Prof         266
Name: rank, dtype: int64

In [180]:
lst = [1, 2, 3, 1, 2, 3]

s = pd.Series([1, 2, 3, 10, 20, 30], lst)
s
# s.groupby(level=0).first()

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

***

# KWDS Concept needs consideration

In [151]:
matrixData = pd.DataFrame(data={'A': [1, 0, 3, 4, 5, 6, 7, 8, 0, 10],
                        'B': [10, 0, 13, 10, 0, 8, 12, 13, 15, 0],
                        'C': [2, 10, 0, 0, 10, 8, 12, 13, 0, 0],
                        'D': [3, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                        'E': [0, 3, 5, 10, 0, 8, 12, 13, 15, 0],
                        'F': [9, 5, 0, 10, 0, 8, 0, 13, 15, 0]})
kwds = {'e1': 'A', 'm1': 'B', 'e2': 'C', 'm2': 'D', 'e3': 'E', 'm3': 'F', 'e4': 'D', 'm4': 'A'}

def calcMoe(df, e1=None, m1=None, e2=None, m2=None, e3=None, m3=None, e4=None, m4=None):
    x = 0
    y = 0
    if  e1 != None:
        if df[e1] == 0:
            x = max(x, df[m1])
        else:
            y = y + df[m1] ** 2      
    if e2 != None:
        if df[e2] == 0: 
            x = max(x, df[m2])
        else:
            y = y + df[m2] ** 2
    if e3 != None :
        if df[e3] == 0 :
            x = max(x, df[m3])
        else:
            y = y + df[m3] ** 2
    if e4 != None :
        if df[e4] == 0 :
            x = max(x, df[m4])
        else:
            y = y + df[m4] ** 2
    return(x ** 2 + y)

df['G'] = df.apply(calcMoe, axis=1, **kwds)
df

Unnamed: 0,A,B,C,D,E,F,G
0,1,10,2,3,0,9,191
1,0,0,10,2,3,5,29
2,3,13,0,3,5,0,187
3,4,10,0,4,10,10,232
4,5,0,10,5,0,0,50
5,6,8,8,6,8,8,200
6,7,12,12,7,12,0,242
7,8,13,13,8,13,13,466
8,0,15,0,9,15,15,450
9,10,0,0,10,0,0,200
