# PCA

PCA stands for principal component analysis. It is the process of computing the principal components of a given dataset, with the goal of reducing the dimension of it. This process helps us to identify which independent variable accounts for the largest variances and is the most important ones in the dataset.

In [7]:
import math
import pandas as pd # data processing
import numpy as np # working with arrays
import matplotlib.pyplot as plt # visualization
import seaborn as sb # visualization

from sklearn import preprocessing
from sklearn.model_selection import train_test_split # data split
from sklearn import decomposition
from sklearn import datasets

## Advanced Data Processing for PCA

We want to do additional data processing before we run PCA. The additiona steps include:

1. Change data type for two columns, we have numerical data with type string.

2. Split the data by month before running PCA. Given our dataset has spatial and temporal dimensions, we want to split the dataset by temporal dimension and run PCA on subsets of it.

3. Normalize the data, given our independent variables have different units and scale, we need to normalize them so their variances are on the same scale.

4. Convert dataframe to numpy array for PCA function

In [2]:
# Update data type of a few columns
cur_data = pd.read_csv(r'../data_collection_clean/df_after_FINAL.csv')
cur_data["Area"] = pd.to_numeric(cur_data["Area"].str.replace(',',''))
cur_data["GDP"] = pd.to_numeric(cur_data["GDP"].str.replace(',',''))
cur_data["Population_Density"] = pd.to_numeric(cur_data["Population_Density"].str.replace(',',''))
cur_data["Violent_Crimes"] = pd.to_numeric(cur_data["Violent_Crimes"].str.replace(',',''))
cur_data["Property_Crime"] = pd.to_numeric(cur_data["Property_Crime"].str.replace(',',''))
cur_data['fully_vaccinated_rate'] = pd.to_numeric(cur_data["fully_vaccinated_rate"])

In [3]:
# Split data by Month
month_dataset_dict = dict()
for month in cur_data['Date']:
    month_dataset_dict[month] = cur_data.loc[cur_data['Date'] == month].dropna(axis='columns')

In [4]:
# Apply normalization, making each value in a column a fraction of the sum of the column
for key, value in month_dataset_dict.items():    
    columns = []
    for column in value.columns:
        if column not in {'Unnamed: 0', 'RegionName', 'State', 'Date', 'Price', 'fully_vaccinated', 'fully_vaccinated_rate'}:
            value[column] = (value[column]) / (value[column].sum())

## PCA Function

In this part, we use the PCA function in sklearn and try to find which variable gives the most variances. We could adjust the number of components we want to keep, as well as the features we want to test

In [10]:
# PCA Functions
def run_pca_and_display(dataset, features, number_of_components):
    independent_variables = dataset[features].to_numpy()
    pca = decomposition.PCA(n_components=number_of_components)
    res = pca.fit(independent_variables).transform(independent_variables)
    V = pca.components_     
    print( "explained variance ratio (first %s components): %s"% (str(number_of_components), str(pca.explained_variance_ratio_)) )
    display(pd.DataFrame(pca.components_,columns=features,index = ['PC- %s' % str(i + 1) for i in range(number_of_components)]))
    n_pcs= pca.components_.shape[0]
    most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
    # get the names
    most_important_names = [features[most_important[i]] for i in range(n_pcs)]
    # using LIST COMPREHENSION HERE AGAIN
    dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
    # build the dataframe
    df = pd.DataFrame(sorted(dic.items()))
    display(df)
    return pca, dataset[most_important_names].to_numpy()

## PCA with all features

In this part, we run all available features for PCA every 6 months. We found that death rate and number of vaccination is the predominant component in our dataset. For 2020-02, we believe this is sensible because some of the counties started to accumulate death cases while others are not. Therefore, we see a lot more variances in the variable death_rate than in other variables. As time passes, fully vaccinated became the predominant factor, as we believe there's a large variation in terms of vaccination across different states. 

In [11]:
# Initial PCA
initial_features = []
for column in month_dataset_dict['2020-02'].columns:    
    if column not in {'Unnamed: 0', 'RegionName', 'State', 'Date', 'Price'}:
        initial_features.append(column)
print(initial_features)
pca, X = run_pca_and_display(month_dataset_dict['2020-02'], initial_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-08'], initial_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-02'], initial_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-08'], initial_features, 3)

['Housing Inventory', 'UnemploymentRate', 'cases', 'deaths', 'cases_rate', 'death_rate', 'fully_vaccinated', 'fully_vaccinated_rate', 'inventory_price_increased', 'inventory_price_decreased', 'median_days_on_market', 'Population', 'Area', 'GDP', 'GDP_pp', 'Population_Density', 'Violent_Crimes', 'Violent_Crimes_pp', 'Property_Crime', 'Property_Crimes_pp', 'Revenue_pp', 'Expenditures_pp', 'Hospital', 'Hospital_pp', 'School', 'School_pp', 'Public_School', 'Public_School_pp', 'Private_School', 'Private_School_pp']
explained variance ratio (first 3 components): [0.51605116 0.30352252 0.08796664]


Unnamed: 0,Housing Inventory,UnemploymentRate,cases,deaths,cases_rate,death_rate,fully_vaccinated,fully_vaccinated_rate,inventory_price_increased,inventory_price_decreased,...,Revenue_pp,Expenditures_pp,Hospital,Hospital_pp,School,School_pp,Public_School,Public_School_pp,Private_School,Private_School_pp
PC- 1,0.087398,-0.01616,0.198727,0.632066,0.056686,0.642161,8.470329e-22,-6.617445e-24,0.090636,0.078167,...,-0.006776,-0.006917,0.098289,-0.026742,0.108897,-0.016125,0.101416,-0.019404,0.131911,0.001588
PC- 2,0.214479,-0.01508,0.212477,-0.288271,-0.002876,-0.296969,3.388132e-21,-0.0,0.233501,0.197646,...,-0.018293,-0.018148,0.262096,-0.038668,0.257273,-0.023795,0.241428,-0.028023,0.306025,-0.000955
PC- 3,-0.111467,-0.032446,0.013224,-0.003801,0.079247,0.008369,0.0,-5.421011e-20,-0.127184,-0.1232,...,-0.006636,-0.004305,-0.041101,-0.020994,-0.073073,-0.023396,-0.083684,-0.032116,-0.040428,0.023725


Unnamed: 0,0,1
0,PC1,death_rate
1,PC2,Violent_Crimes
2,PC3,Population_Density


explained variance ratio (first 3 components): [9.98641182e-01 7.33799973e-04 3.25830737e-04]


Unnamed: 0,Housing Inventory,UnemploymentRate,cases,deaths,cases_rate,death_rate,fully_vaccinated,fully_vaccinated_rate,inventory_price_increased,inventory_price_decreased,...,Revenue_pp,Expenditures_pp,Hospital,Hospital_pp,School,School_pp,Public_School,Public_School_pp,Private_School,Private_School_pp
PC- 1,0.007136,0.000296,0.007716,0.007747,-0.000155,-0.00024,0.999604,3.811727e-07,0.007779,0.007273,...,-0.000523,-0.000535,0.006969,-0.001377,0.007477,-0.000863,0.007043,-0.001019,0.008816,-2.3e-05
PC- 2,0.228975,0.016104,0.313671,0.356189,0.017559,0.024486,-0.027391,-8.337001e-06,0.186014,0.197987,...,-0.014579,-0.014026,0.258754,-0.033321,0.242271,-0.020339,0.225633,-0.024274,0.293459,0.000914
PC- 3,-0.04493,-0.009598,-0.122397,-0.157918,-0.021722,-0.058189,0.00379,3.50308e-06,-0.104507,-0.012522,...,-0.008106,-0.005896,-0.051203,-0.024113,-0.077724,-0.024735,-0.087292,-0.034024,-0.048287,0.025463


Unnamed: 0,0,1
0,PC1,fully_vaccinated
1,PC2,deaths
2,PC3,Population_Density


explained variance ratio (first 3 components): [1.00000000e+00 3.38280779e-13 1.23207260e-13]


Unnamed: 0,Housing Inventory,UnemploymentRate,cases,deaths,cases_rate,death_rate,fully_vaccinated,fully_vaccinated_rate,inventory_price_increased,inventory_price_decreased,...,Revenue_pp,Expenditures_pp,Hospital,Hospital_pp,School,School_pp,Public_School,Public_School_pp,Private_School,Private_School_pp
PC- 1,3.405754e-07,-1.132825e-09,3.661185e-07,5.089694e-07,1.933861e-10,5.566912e-08,1.0,4.310539e-09,3.10768e-07,3.376444e-07,...,-1.735209e-08,-1.746931e-08,3.096909e-07,-4.36947e-08,3.154503e-07,-2.788678e-08,2.946714e-07,-3.325363e-08,3.794169e-07,8.81448e-10
PC- 2,-0.001747675,-0.01875806,-0.04543607,-0.07584649,-0.01975111,-0.008281373,-1.068188e-07,0.03320714,-0.072411,0.002033637,...,-0.01047955,-0.008245405,0.01033921,-0.03210196,-0.02097052,-0.02952106,-0.0330393,-0.03893134,0.01618252,0.02093163
PC- 3,-0.0619207,-0.04172902,-0.08459673,-0.07768123,-0.03754233,-0.1835804,3.569662e-07,0.7850115,-0.1657604,-0.04228648,...,0.2482684,0.2483475,-0.06436731,0.2645985,-0.06927375,-0.0411589,-0.08779118,-0.04826355,-0.01226892,-0.00307137


Unnamed: 0,0,1
0,PC1,fully_vaccinated
1,PC2,Population_Density
2,PC3,fully_vaccinated_rate


explained variance ratio (first 3 components): [1.00000000e+00 1.83242030e-12 3.61549498e-13]


Unnamed: 0,Housing Inventory,UnemploymentRate,cases,deaths,cases_rate,death_rate,fully_vaccinated,fully_vaccinated_rate,inventory_price_increased,inventory_price_decreased,...,Revenue_pp,Expenditures_pp,Hospital,Hospital_pp,School,School_pp,Public_School,Public_School_pp,Private_School,Private_School_pp
PC- 1,6.902414e-07,1.210425e-08,7.491985e-07,7.788604e-07,-3.364273e-08,-3.460075e-08,1.0,4.562679e-08,8.282593e-07,6.260642e-07,...,-3.973039e-08,-4.003875e-08,7.12519e-07,-1.007246e-07,7.262366e-07,-6.366358e-08,6.831289e-07,-7.432591e-08,8.58941e-07,-6.511958e-09
PC- 2,0.006448337,-0.01615335,-0.00656098,-0.01423175,-0.02772214,-0.0368065,-5.226913e-07,-0.01497353,-0.03385248,-0.0135235,...,-0.01294931,-0.01080584,0.0435924,-0.03613523,0.01359947,-0.03031084,-0.0009056546,-0.04005992,0.05825265,0.02195743
PC- 3,-0.05319986,-0.03525418,-0.06358219,-0.068264,-0.09215843,-0.2308232,3.446996e-07,-0.1338741,-0.09258198,-0.10864,...,0.2739591,0.2725979,0.01294897,0.8039294,-0.01670271,0.144471,-0.03876563,0.1658747,0.05121672,0.02972081


Unnamed: 0,0,1
0,PC1,fully_vaccinated
1,PC2,Population_Density
2,PC3,Hospital_pp


## PCA without COVID Features

In previous section, we saw that COVID related factors accounted for most of the variances, but it is not convincing that the number of death case is the primary driver of housing prices. Therefore, we explore independent variables excluding COVID factors.

Running PCA on the same time point we found that private school, population and hospital per person are the primary driving factors. This is inline with intuition that people tend to find housing closer to good school and more hospital. In the mean time, more people usually means the housing prices are higher. For example, NYC and SF's housing are in general more expensive. 

Across different time points, we found the 3 factor accounts for roughly 70, 18 and 3 percent. This number is consistent throughout the past 2 years.

In [None]:
no_covid_features = []
for column in month_dataset_dict['2020-02'].columns:    
    if column not in {'Unnamed: 0', 'RegionName', 'State', 'Date', 'Price','cases', 'deaths', 'cases_rate', 'death_rate', 'fully_vaccinated', 'fully_vaccinated_rate'}:
        no_covid_features.append(column)
print(no_covid_features)
pca, X = run_pca_and_display(month_dataset_dict['2020-02'], no_covid_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-08'], no_covid_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-02'], no_covid_features, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-08'], no_covid_features, 3)

## Per Capita PCA with all features

There is a chance there's a large correlation between a variable and the per-capita version of it. In this section, we only keep the per-capita version of the variables and try to see which variable will become the predominant variable.

We have to run more dates because there's not one factor that dominates all the time points. Nonetheless, we found that death rate and and Hospital per person are the major components. However, the principal components are not as convincing as before, given the top explained variance ratio are only around 40 to 60 percent except for the 2020-02.

In [None]:
per_capita_pca = []
for column in month_dataset_dict['2020-02'].columns:    
    if column not in {'Unnamed: 0', 'RegionName', 'State', 'Date', 'Price','Housing Inventory', 'UnemploymentRate', 'cases', 'deaths', 'fully_vaccinated','inventory_price_increased', 'inventory_price_decreased', 'median_days_on_market', 'Population', 'Area', 'GDP', 'Population_Density', 
                      'Violent_Crimes', 'Property_Crime', 'Hospital', 'School', 'Public_School','Private_School',}:
        per_capita_pca.append(column)
print(per_capita_pca)
pca, X = run_pca_and_display(month_dataset_dict['2020-02'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-06'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-08'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-11'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-02'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-06'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-08'], per_capita_pca, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-11'], per_capita_pca, 3)


## Per Capita PCA without COVID features

Similarly, we want to explore which factor influnences the housing price most if not for COVID. Therefore, we remove covid variables here.

We ran 8 different months, and we found that hospital per person, expenditures per person and public school per person are the major factors. The explained variance ratio are around 47, 25 and 13 percent.

We understand that we are missing more granular(monthly) data on variables like number of schools and or monthly expenditure. Therefore, the result across different months looks similar to each other. However, this is still inline with our understanding of the housing market. As a potential followup for future, we can include more granular data on these fields to explore the influences of these factors across time. 

In [None]:
per_capita_pca_no_covid = []
for column in month_dataset_dict['2020-02'].columns:    
    if column not in {'Unnamed: 0', 'RegionName', 'State', 'Date', 'Price','Housing Inventory', 'UnemploymentRate', 'cases', 'deaths', 'fully_vaccinated','inventory_price_increased', 'inventory_price_decreased', 'median_days_on_market', 'Population', 'Area', 'GDP', 'Population_Density', 
                      'Violent_Crimes', 'Property_Crime', 'Hospital', 'School', 'Public_School','Private_School','cases', 'deaths', 'cases_rate', 'death_rate', 'fully_vaccinated', 'fully_vaccinated_rate'}:
        per_capita_pca_no_covid.append(column)
print(per_capita_pca_no_covid)
pca, X = run_pca_and_display(month_dataset_dict['2020-02'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-06'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-08'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2020-11'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-02'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-06'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-08'], per_capita_pca_no_covid, 3)
pca, X = run_pca_and_display(month_dataset_dict['2021-11'], per_capita_pca_no_covid, 3)


## Conclusion

We explored a few different scenarios in PCA. We found that for COVID related features, death rate has the largest variance but is somehow biased towards the beginning of our data. We also found hospital has a lot variances no matter whether COVID features are included. 

If we exclude all COVID features, private school is a principal factor in our dataset followed by population density and hospital per person. If we only consider per-capita factors, hospital, expenditures and public school are more significant.

In conclusion, in the supervised learning part, we could pay some attention to death rate, hospital per person, school  and expenditure when selecting independent variables. We also believe more granular data by time will also be helpful in exploring PCA across time