## Week 19 in class assignment

### 1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. Reference: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/


In [2]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

heart_df = pd.read_csv('df.csv') #bring in Clarine's preprocessed arrhythmia dataset

X = heart_df.drop('class', axis=1).values
y = heart_df['class'].values

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

# Standardize
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)

#logistic regression
clr = LogisticRegression(random_state=42).fit(X_train_sc,y_train)

#predict
y_predicted = clr.predict(X_test_sc)

#accuracy
print('Accuracy score from logistic regression:', clr.score(X_test_sc,y_test).round(2))

Accuracy score from logistic regression: 0.76


#### The 3 techniques we chose were SVD, PCA and Isomap

### 1. SVD

SVD - short for Singular Value Decomposition - this technique is typically used with data that has many zero values in it. n_components is the number of desired dimensions (columns) of the dataset (the output of the transform). SVD reduces the number of features or columns while maintaining the essence (relationships) of the original data.

Using the best value for n_components(54), the accuracy increased by .03 from the logistic regression model.

Using an iterator, find the best value for n_components. 

In [3]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import TruncatedSVD

score1 = 0
n1 = 0

for n in range(1,136):
    #build and fit the model
    svd = TruncatedSVD(n_components=n)
    
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    
    clr = LogisticRegression(random_state=42).fit(X_train_svd, y_train)
    
    score = clr.score(X_test_svd,y_test)
    
    if score1 < score:
        score1 = score
        n1 = n
        
print(n1,score1)
print('Accuracy score from SVD:', score1.round(2))

54 0.7941176470588235
Accuracy score from SVD: 0.79


#### Run SVD for the n_components number deemed to have the best results.

In [4]:
svd = TruncatedSVD(n_components=54)
    
X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.transform(X_test)
    
clr = LogisticRegression(random_state=42).fit(X_train_svd, y_train)
    
clr.score(X_test_svd,y_test)
print('Accuracy score from SVD:', clr.score(X_test_svd,y_test).round(2))

Accuracy score from SVD: 0.79


### 2. PCA

PCA - short for Principal Component Analysis - this technique is typically used with data that has few zero values in it. n_components can also be set for PCA to configure the number of desired dimensions/columns of the dataset. This is used for linear dimensinality reduction using SVD to project it to a lower dimensional space. Input data is centered but not scaled for each feature before applying SVD. 

Using the best value for n_components(51), the accuracy increased by .06 from the logistic regression model.

Using an iterator, find the best value for n_components

In [5]:
from sklearn.decomposition import PCA

score1 = 0
n1 = 0

for n in range(1,136):
    #build and fit the model
    pca = PCA(n_components=n)
    
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    classifier = LogisticRegression(random_state=42).fit(X_train_pca, y_train)
    
    score = classifier.score(X_test_pca,y_test)
    
    if score1 < score:
        score1 = score
        n1 = n

print(n1,score1)
print('Accuracy score from PCA:', score1.round(2))

51 0.8235294117647058
Accuracy score from PCA: 0.82


#### Run PCA for the n_components number deemed to have the best results.

In [6]:
pca = PCA(n_components=51)
    
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
    
classifier = LogisticRegression(random_state=42).fit(X_train_pca, y_train)
    
print('Accuracy score from PCA:', classifier.score(X_test_pca,y_test).round(2))

Accuracy score from PCA: 0.82


### 3. Isomap

Isomap stands for isometric mapping. It is a nonlinear dimensionality reduction method that tries to preserve the geodesic distances in the lower dimension. Geodesic distance is used instead of Euclidean. In non-linear manifolds, the Euclidean metric for distance holds good if the neighborhood structure can be approximated as linear. If the neighborhood contains holds, Euclidean distances can be misleading.

Using the best value for n_components(23), the accuracy decreased by .02 from the logistic regression model.

Using an iterator, find the best value for n_components

In [7]:
from sklearn.manifold import Isomap

score1 = 0
n1 = 0

for n in range(1,136):
    #build and fit the model
    iso = Isomap(n_components=n)
    X_train_iso = iso.fit_transform(X_train)
    X_test_iso = iso.transform(X_test)
    
    classifier = LogisticRegression(random_state=42).fit(X_train_iso, y_train)
    
    score = classifier.score(X_test_iso,y_test)
    
    if score1 < score:
        score1 = score
        n1 = n
        
print(n1,score1)
print('Accuracy score from Isomap:', score1.round(2))

23 0.7352941176470589
Accuracy score from Isomap: 0.74


#### Run Isomap for the n_components number deemed to have the best results.

In [16]:
iso = Isomap(n_components=23)

X_train_iso = iso.fit_transform(X_train)
X_test_iso = iso.transform(X_test)
    
iso_model = LogisticRegression(random_state=42).fit(X_train_iso, y_train)
    
print('Accuracy score from Isomap:', iso_model.score(X_test_iso,y_test).round(2))

Accuracy score from Isomap: 0.74


### 2. Write a function that will indicate if an inputted IPv4 address is accurate or not. IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.

Input 1:

2.33.245.5

Output 1:

True

Input 2:

12.345.67.89

Output 2:

False

#### Thoughts on this one - Split into 4 fields, check each field, if any field is not between 0 and 255, return False. Also if there aren't 4 fields after splitting, return False.

In [1]:
def IPA(address):
    try:
        ip_no = address.split('.')
    
        if len(ip_no) != 4:
            return False
    
        for num in ip_no:
            if int(num) < 0 or int(num) > 255:
                return False
        return True

    except Exception as e:
        return False

In [2]:
IPA('2.33.245.5')

True

In [3]:
IPA('12.345.67.89')

False

In [4]:
IPA('2.33.344.5')

False

In [5]:
IPA('2.244.5')

False