In [3]:
import pandas as pd
from sklearn.decomposition import PCA

# Dimension reduction with PCA
In this exercise, we use the [census crime data](https://moodle.city.ac.uk/pluginfile.php/2782683/mod_page/content/5/censusCrimeClean.csv) that we've used before..

We will be doing some PCA on these data. PCA is about dimension-reduction so we will creating synthetic variables that try to capture most variation and studying the contributions of each variable to this. To remind you, each row (sample/record) is a "community" (small geographical region) with census data (and crime) recorded.

Using real and messy data often makes the interpretation more difficult. In this exercise, you will see first-hand an example of what happens if we include some invalid features.

1. Load the data (above) in a DataFrame.
2. Since PCA only works with numerical data, extract all the data except the first column (which is text) and store as a variable.
3. Create and fit PCA with two components with these features
4. Look at the `explained_variance_ratio_`. How much of the variation are these two components capturing? Try standardising the variables and see what effect that has. Standardising scales the variables so they all have the same variance. This is [usually recommended for PCA](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py), because it ensure all variables considered with the same weighing. But in our case, the variables are all percentages, so it may be less important. Comment on how much of the variation is captured into these two variables.
5. [Transform](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform) the samples into the principal components and plot them on a scatterplot. Each dot (sample) is a "community" (small geographical region). How does this look?
6. Look at the loadings. The `components_` attributes contains the loadings that give the contribution of each feature on the principal component (listed in the same order as the column order). I suggest you:
    - put them in a panda dataframe with the original column headings as the headings
    - transpose so that the rows are the features and the columns are the components
    - convert the values to absolute values (remove the sign, we don't need it)
    - sort by the (absolute) loadings in the first column
    These will tell you which variables most strongly contribute. You will see that one massively dominates and this may help you interpret the scatterplot. Have a look at the variable. Why should be not be using it?
7. Repeat with this variable removed.
8. As an extra - plot the samples in the principal component space (like before). Colour the points by `ViolentCrimesPerPop`. In which principal components does this vary? Look at the loadings - which variables seem to relate to this? Note that this is not a great way to analyse this, just a by-product of what we have done.

In [2]:
df = pd.read_csv("censusCrimeClean.csv", header = 0)
df

Unnamed: 0,communityname,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,...,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,ViolentCrimesPerPop
0,Lakewoodcity,1,0.19,0.33,0.02,0.90,0.12,0.17,0.34,0.47,...,0.00,0.12,0.42,0.50,0.51,0.64,0.12,0.26,0.20,0.20
1,Tukwilacity,1,0.00,0.16,0.12,0.74,0.45,0.07,0.26,0.59,...,0.00,0.21,0.50,0.34,0.60,0.52,0.02,0.12,0.45,0.67
2,Aberdeentown,1,0.00,0.42,0.49,0.56,0.17,0.04,0.39,0.47,...,0.00,0.14,0.49,0.54,0.67,0.56,0.01,0.21,0.02,0.43
3,Willingborotownship,1,0.04,0.77,1.00,0.08,0.12,0.10,0.51,0.50,...,0.00,0.19,0.30,0.73,0.64,0.65,0.02,0.39,0.28,0.12
4,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,0.05,0.38,0.38,...,0.00,0.11,0.72,0.64,0.61,0.53,0.04,0.09,0.02,0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,TempleTerracecity,10,0.01,0.40,0.10,0.87,0.12,0.16,0.43,0.51,...,0.00,0.22,0.28,0.34,0.48,0.39,0.01,0.28,0.05,0.09
1990,Seasidecity,10,0.05,0.96,0.46,0.28,0.83,0.32,0.69,0.86,...,0.00,0.53,0.25,0.17,0.10,0.00,0.02,0.37,0.20,0.45
1991,Waterburytown,10,0.16,0.37,0.25,0.69,0.04,0.25,0.35,0.50,...,0.02,0.25,0.68,0.61,0.79,0.76,0.08,0.32,0.18,0.23
1992,Walthamcity,10,0.08,0.51,0.06,0.87,0.22,0.10,0.58,0.74,...,0.01,0.45,0.64,0.54,0.59,0.52,0.03,0.38,0.33,0.19


In [4]:
df.drop(labels='communityname', axis=1)

Unnamed: 0,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,...,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,ViolentCrimesPerPop
0,1,0.19,0.33,0.02,0.90,0.12,0.17,0.34,0.47,0.29,...,0.00,0.12,0.42,0.50,0.51,0.64,0.12,0.26,0.20,0.20
1,1,0.00,0.16,0.12,0.74,0.45,0.07,0.26,0.59,0.35,...,0.00,0.21,0.50,0.34,0.60,0.52,0.02,0.12,0.45,0.67
2,1,0.00,0.42,0.49,0.56,0.17,0.04,0.39,0.47,0.28,...,0.00,0.14,0.49,0.54,0.67,0.56,0.01,0.21,0.02,0.43
3,1,0.04,0.77,1.00,0.08,0.12,0.10,0.51,0.50,0.34,...,0.00,0.19,0.30,0.73,0.64,0.65,0.02,0.39,0.28,0.12
4,1,0.01,0.55,0.02,0.95,0.09,0.05,0.38,0.38,0.23,...,0.00,0.11,0.72,0.64,0.61,0.53,0.04,0.09,0.02,0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,10,0.01,0.40,0.10,0.87,0.12,0.16,0.43,0.51,0.35,...,0.00,0.22,0.28,0.34,0.48,0.39,0.01,0.28,0.05,0.09
1990,10,0.05,0.96,0.46,0.28,0.83,0.32,0.69,0.86,0.73,...,0.00,0.53,0.25,0.17,0.10,0.00,0.02,0.37,0.20,0.45
1991,10,0.16,0.37,0.25,0.69,0.04,0.25,0.35,0.50,0.31,...,0.02,0.25,0.68,0.61,0.79,0.76,0.08,0.32,0.18,0.23
1992,10,0.08,0.51,0.06,0.87,0.22,0.10,0.58,0.74,0.63,...,0.01,0.45,0.64,0.54,0.59,0.52,0.03,0.38,0.33,0.19
