# DAT210x - Programming with Python for DS

## Module4- Lab2

In [50]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

from mpl_toolkits.mplot3d import Axes3D

from sklearn import preprocessing

In [51]:
# Look pretty...

# matplotlib.style.use('ggplot')
plt.style.use('ggplot')

### Some Boilerplate Code

For your convenience, we've included some boilerplate code here which will help you out. You aren't expected to know how to write this code on your own at this point, but it'll assist with your visualizations. We've added some notes to the code in case you're interested in knowing what it's doing:

### A Note on SKLearn's `.transform()` calls:

Any time you perform a transformation on your data, you lose the column header names because the output of SciKit-Learn's `.transform()` method is an NDArray and not a daraframe.

This actually makes a lot of sense because there are essentially two types of transformations:
- Those that adjust the scale of your features, and
- Those that change alter the number of features, perhaps even changing their values entirely.

An example of adjusting the scale of a feature would be changing centimeters to inches. Changing the feature entirely would be like using PCA to reduce 300 columns to 30. In either case, the original column's units have either been altered or no longer exist at all, so it's up to you to assign names to your columns after any transformation, if you'd like to store the resulting NDArray back into a dataframe.

In [52]:
def scaleFeaturesDF(df):
    # Feature scaling is a type of transformation that only changes the
    # scale, but not number of features. Because of this, we can still
    # use the original dataset's column names... so long as we keep in
    # mind that the _units_ have been altered:

    scaled = preprocessing.StandardScaler().fit_transform(df)
    scaled = pd.DataFrame(scaled, columns=df.columns)
    
    print("New Variances:\n", scaled.var())
    print("New Describe:\n", scaled.describe())
    return scaled

SKLearn contains many methods for transforming your features by scaling them, a type of pre-processing):
    - `RobustScaler`
    - `Normalizer`
    - `MinMaxScaler`
    - `MaxAbsScaler`
    - `StandardScaler`
    - ...

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

However in order to be effective at PCA, there are a few requirements that must be met, and which will drive the selection of your scaler. PCA requires your data is standardized -- in other words, it's _mean_ should equal 0, and it should have unit variance.

SKLearn's regular `Normalizer()` doesn't zero out the mean of your data, it only clamps it, so it could be inappropriate to use depending on your data. `MinMaxScaler` and `MaxAbsScaler` both fail to set a unit variance, so you won't be using them here either. `RobustScaler` can work, again depending on your data (watch for outliers!). So for this assignment, you're going to use the `StandardScaler`. Get familiar with it by visiting these two websites:

- http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

Lastly, some code to help with visualizations:

In [53]:
def drawVectors(transformed_features, components_, columns, plt, scaled):
    if not scaled:
        return plt.axes() # No cheating ;-)

    num_columns = len(columns)

    # This funtion will project your *original* feature (columns)
    # onto your principal component feature-space, so that you can
    # visualize how "important" each one was in the
    # multi-dimensional scaling

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ## visualize projections

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    print("Features by importance:\n", important_features)

    ax = plt.axes()

    for i in range(num_columns):
        # Use an arrow to project each original feature as a
        # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

    return ax

### And Now, The Assignment

In [107]:
# Do * NOT * alter this line, until instructed!
scaleFeatures = True

Load up the dataset specified on the lab instructions page and remove any and all _rows_ that have a NaN in them. You should be a pro at this by now ;-)

**QUESTION**: Should the `id` column be included in your dataset as a feature?

In [120]:
df = pd.read_csv('./Datasets/kidney_disease.csv', index_col=0)
df.dropna(axis=0, how='any')
#df.loc[230, 'classification']
#df.head(10)
#df.dtypes
df.loc[:, 'wc'] = pd.to_numeric(df.loc[:, 'wc'], errors='coerce')
df.loc[:, 'rc'] = pd.to_numeric(df.loc[:, 'rc'], errors='coerce')
df.tail(10)

Unnamed: 0_level_0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
390,52.0,80.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,99.0,...,52,6300.0,5.3,no,no,no,good,no,no,notckd
391,36.0,80.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,85.0,...,44,5800.0,6.3,no,no,no,good,no,no,notckd
392,57.0,80.0,1.02,0.0,0.0,normal,normal,notpresent,notpresent,133.0,...,46,6600.0,5.5,no,no,no,good,no,no,notckd
393,43.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,117.0,...,54,7400.0,5.4,no,no,no,good,no,no,notckd
394,50.0,80.0,1.02,0.0,0.0,normal,normal,notpresent,notpresent,137.0,...,45,9500.0,4.6,no,no,no,good,no,no,notckd
395,55.0,80.0,1.02,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,47,6700.0,4.9,no,no,no,good,no,no,notckd
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,54,7800.0,6.2,no,no,no,good,no,no,notckd
397,12.0,80.0,1.02,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,49,6600.0,5.4,no,no,no,good,no,no,notckd
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,51,7200.0,5.9,no,no,no,good,no,no,notckd
399,58.0,80.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,131.0,...,53,6800.0,6.1,no,no,no,good,no,no,notckd


Let's build some color-coded labels; the actual label feature will be removed prior to executing PCA, since it's unsupervised. You're only labeling by color so you can see the effects of PCA:

In [121]:
labels = ['red' if i=='ckd' else 'green' for i in df.classification]

Use an indexer to select only the following columns: `['bgr','wc','rc']`

In [123]:
columns = ['bgr', 'wc', 'rc']
#df1 = df.loc[:, columns]
#df1.dtypes
#df1 = df1.dropna(axis=0, how='any')
#df1 = df1.reset_index()
df1 = df1.loc[:, columns]
df1.tail(20)

Unnamed: 0,bgr,wc,rc
231,113.0,6500.0,4.9
232,79.0,5800.0,5.9
233,75.0,6000.0,6.5
234,119.0,5100.0,5.0
235,132.0,11000.0,4.5
236,113.0,8000.0,5.1
237,100.0,5700.0,6.5
238,93.0,6200.0,5.2
239,94.0,9500.0,6.4
240,112.0,7200.0,5.8


Either take a look at the dataset's webpage in the attribute info section of UCI's [Chronic Kidney Disease]() page,: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease or alternatively, you can actually look at the first few rows of your dataframe using `.head()`. What kind of data type should these three columns be? Compare what you see with the results when you print out your dataframe's `dtypes`.

If Pandas did not properly detect and convert your columns to the data types you expected, use an appropriate command to coerce these features to the right type.

In [111]:
df1.head(10)

Unnamed: 0,bgr,wc,rc
0,121.0,7800.0,5.2
1,117.0,6700.0,3.9
2,106.0,7300.0,4.6
3,74.0,7800.0,4.4
4,410.0,6900.0,5.0
5,138.0,9600.0,4.0
6,70.0,12100.0,3.7
7,380.0,4500.0,3.8
8,208.0,12200.0,3.4
9,157.0,11000.0,2.6


PCA Operates based on variance. The variable with the greatest variance will dominate. Examine your data using a command that will check the variance of every feature in your dataset, and then print out the results. Also print out the results of running `.describe` on your dataset.

_Hint:_ If you do not see all three variables: `'bgr'`, `'wc'`, and `'rc'`, then it's likely you probably did not complete the previous step properly.

In [96]:
df1.var()

bgr    5.917921e+03
wc     8.752551e+06
rc     1.037287e+00
dtype: float64

In [97]:
df1.describe()

Unnamed: 0,bgr,wc,rc
count,251.0,251.0,251.0
mean,143.40239,8529.083665,4.729482
std,76.928028,2958.471017,1.018473
min,70.0,3800.0,2.1
25%,99.0,6700.0,4.0
50%,118.0,8100.0,4.8
75%,142.0,9800.0,5.4
max,490.0,26400.0,8.0


In [124]:
# Magic command, works inside jupyter notebooks
# This includes an interactive control/renderer and does not require plt.show()
%matplotlib notebook

# Render the dataset with labels

fig = plt.figure()
ax  = fig.add_subplot(111, projection='3d')

ax.set_title('Stuff')
ax.set_xlabel('bgr')
ax.set_ylabel('wc')
ax.set_zlabel('rc')
ax.scatter(df1.bgr, df1.wc, df1.rc, c=labels, marker='.', alpha=0.75)

plt.show()

<IPython.core.display.Javascript object>

Below, we assume your dataframe's variable is named `df`. If it isn't, make the appropriate changes. But do not alter the code in `scaleFeaturesDF()` just yet!

In [125]:
# .. your (possible) code adjustment here ..
if scaleFeatures: df1 = scaleFeaturesDF(df1)
    
df1.var()

New Variances:
 bgr    1.004
wc     1.004
rc     1.004
dtype: float64
New Describe:
                 bgr            wc            rc
count  2.510000e+02  2.510000e+02  2.510000e+02
mean  -5.993435e-17  1.508311e-16  1.928515e-16
std    1.001998e+00  1.001998e+00  1.001998e+00
min   -9.560761e-01 -1.601683e+00 -2.586947e+00
25%   -5.783472e-01 -6.194883e-01 -7.176818e-01
50%   -3.308696e-01 -1.453254e-01  6.937722e-02
75%   -1.826633e-02  4.304438e-01  6.596715e-01
max    4.514481e+00  6.052661e+00  3.217613e+00


bgr    1.004
wc     1.004
rc     1.004
dtype: float64

Run PCA on your dataset, reducing it to 2 principal components. Make sure your PCA model is saved in a variable called `'pca'`, and that the results of your transformation are saved in another variable `'T'`:

In [127]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, svd_solver='full')
pca.fit(df1)

T = pca.transform(df1)

Now, plot the transformed data as a scatter plot. Recall that transforming the data will result in a NumPy NDArray. You can either use MatPlotLib to graph it directly, or you can convert it back to DataFrame and have Pandas do it for you.

Since we've already demonstrated how to plot directly with MatPlotLib in `Module4/assignment1.ipynb`, this time we'll show you how to convert your transformed data back into to a Pandas Dataframe and have Pandas plot it from there.

In [128]:
# Since we transformed via PCA, we no longer have column names; but we know we
# are in `principal-component` space, so we'll just define the coordinates accordingly:
ax = drawVectors(T, pca.components_, df.columns.values, plt, scaleFeatures)
T  = pd.DataFrame(T)

T.columns = ['component1', 'component2']
T.plot.scatter(x='component1', y='component2', marker='o', c=labels, alpha=0.75, ax=ax)

plt.show()

IndexError: index 3 is out of bounds for axis 0 with size 3