### Dimensionality Reduction Techniques 
### This is reducing the number of variables. For example, if you have 1000 variables in your dataset then you need to  understand what variables are important and then build a simple model with less number of variables.  The model will be faster as well. 

#### Problems with building all the variables will make the model very complex.  
#### When it goes to production because it is not easy to implement with these many variables. 
#### Pyspark - You may have 3000 variables but when you build it, you will create model around 25-30 variables. 
#### Implementation is a tedious task because some other team will implement, so you still have to ensure that 
#### the model is giving the same results.  Therefore, dropping variables is one of the ways for feature selection. 
#### Now assume that someone says that don't drop the variables because all are important.  So, what is the next step?
#### This is where Dimensionality Reduction Technique comes into picture. 
#### Principal Component Analysis (PCA): Unsupervised Learning
#### Linear Discriminant Analysis (LDA): Supervised Learning


##### Principal Component Analysis (PCA): 
##### Eigen Value, Eigen Vector
##### Eigen Vector determines the direction of values
##### Eigen Value determines the length/dimension of values
##### Covariance


##### Linear Discriminant Analysis: LDA is only applicable for classification dataset. 
##### LDA requires the support of target variable. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [61]:
df=pd.read_csv('voice-classification.csv')

In [62]:
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


In [63]:
X= df.drop('label', axis=1)
y = df['label']

In [64]:
from sklearn.decomposition import PCA

In [65]:
pca = PCA(n_components = 2)

In [66]:
train_data = pca.fit_transform(X)

In [67]:
train_data

array([[238.08129358,  -4.80224677],
       [598.39659272,  -1.02137403],
       [988.7626671 ,   3.02260542],
       ...,
       [-29.95674252,  -3.26518493],
       [-31.19087408,  -2.32061968],
       [-30.76269816,  -6.64483239]])

In [68]:
new_df = pd.DataFrame(train_data, columns=['PC1', 'PC2'])

In [69]:
new_df.head()

Unnamed: 0,PC1,PC2
0,238.081294,-4.802247
1,598.396593,-1.021374
2,988.762667,3.022605
3,-32.368181,-6.639116
4,-32.286811,0.339167


In [70]:
pca.explained_variance_ratio_*100

array([99.86814257,  0.12673728])

In [72]:
from sklearn.decomposition import PCA

In [76]:
lda = LinearDiscriminantAnalysis(n_components = 1)
data = lda.fit_transform(X, y)

In [77]:
new_df = pd.DataFrame(data, columns=['PC1'])

In [78]:
new_df.head()

Unnamed: 0,PC1
0,1.236091
1,-1.065892
2,0.876128
3,3.033699
4,2.32972


In [23]:
df2 = pd.read_csv('seattleWeather_1948-2017.csv')

In [27]:
df2.head()

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True


In [32]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25551 entries, 0 to 25550
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    25551 non-null  object 
 1   PRCP    25548 non-null  float64
 2   TMAX    25551 non-null  int64  
 3   TMIN    25551 non-null  int64  
 4   RAIN    25548 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 998.2+ KB


In [35]:
from sklearn.preprocessing import LabelEncoder

In [42]:
le = LabelEncoder()
df2['DATE']=le.fit_transform(df2['DATE'])
df2.head(2)

Unnamed: 0,DATE,PRCP,TMAX,TMIN,RAIN
0,0,0.47,51,42,True
1,1,0.59,45,36,True


In [47]:
X1= df2.drop('RAIN', axis=1)
y1 = df2['RAIN']

In [38]:
from sklearn.decomposition import PCA

In [39]:
pca = PCA(n_components = 2)

In [44]:
#train_data = pca.fit_transform(X1)

In [45]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [54]:
lda = LinearDiscriminantAnalysis(n_components = 1)
data = lda.fit_transform(X, y)

In [55]:
new_df = pd.DataFrame(data, columns=['PC1'])

In [56]:
new_df.head()

Unnamed: 0,PC1
0,1.236091
1,-1.065892
2,0.876128
3,3.033699
4,2.32972


## Decision Tree 

##### Just like a tree has roots, trunk, and branches, 
##### a decision tree flowchart starts with root node, and then have branches, 
##### and at the end of the tree nodes, every branch has leaf nodes.

###### Assume that you have age, salary, loc, and loan approval as target variable.
###### First node is root node.  
###### Gini Index tries to find out how impure the split is and always gives a value between 0 and 1. 
###### Any value closer to 0 is a good split.  If Gini index is closer to 1 then impure is high, so split is bad.
###### 

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [10]:
df = pd.read_csv('pima-indians-diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [23]:
from sklearn.model_selection import train_test_split
X = df.drop('Outcome', axis=1)
y= df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=100)

In [24]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [28]:
pred_dt = dt.predict(X_test)

In [39]:
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [45]:
from sklearn.metrics import accuracy_score, classification_report
accuracy_score(y_test, pred_dt)

0.6379310344827587

In [41]:
accuracy_score(y_train, dt.predict(X_train))

1.0

In [46]:
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       0.72      0.72      0.72        75
           1       0.49      0.49      0.49        41

    accuracy                           0.64       116
   macro avg       0.60      0.60      0.60       116
weighted avg       0.64      0.64      0.64       116

