-----------------------------
## Context:
-----------------------------
In this case study, we will use the Air pollution dataset which contains information about 13 months of data on major pollutants and meteorological levels of a city. 

-----------------------------
## Objective: 
-----------------------------
The objective of this problem is to reduce the number of features by using dimensionality reduction techniques like PCA and extract insights. 

## Importing libraries and overview of the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

#Importing PCA and TSNE
from sklearn.decomposition import PCA

#### Loading data

In [None]:
#Loading data
data= pd.read_csv("Air_Pollution.csv")

In [None]:
data.head()

#### Check the info of the data

In [None]:
data.info()

- There are 403 observations and 27 columns in the data.
- All the columns except Date and Weather are of numeric data type.
- The Date and SrNo for all observations would be unique. We can drop these columns as they would not add value to our analysis.
- Weather is of object data type. We can create dummy variables for each category and convert it to numeric data type.
- The majority of the columns have some missing values.
- Let's check the number of missing values in each column.

In [None]:
data.isnull().sum()

- All the columns except SrNo and Date have missing values.

#### Data Preprocessing

In [None]:
data.drop(columns=["SrNo", "Date"], inplace=True)

In [None]:
#Imputing missing values with mode(most frequent) for the Weather column and with median for all other columns
for col in data.columns:
    if col == "Weather":
        data[col].fillna(value=data[col].mode()[0], inplace=True)
    else:
        data[col].fillna(value=data[col].median(), inplace=True)

In [None]:
#Creating dummy variables for Weather column
data = pd.get_dummies(data, drop_first=True)

In [None]:
data.head()

#### Scaling the data

### Question 1: Define Standard scaler and fit to the data_scaled

In [None]:
scaler = ________
data_scaled = __________

In [None]:
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)

## Principal Component Analysis

### Question 2: Define PCA with n components and random_state =1 and fit to the scaled data.

In [None]:
#Defining the number of principal components to generate 
n = data_scaled.shape[1]

#Finding principal components for the data
pca1 = PCA(__________)
data_pca = pd.DataFrame(pca1._________________)

#The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_

In [None]:
# visulaize the explained variance by individual components
plt.figure(figsize = (10,10))
plt.plot(range(1,29), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")

### Question 3: How many Principal components explains more than 70% variance in the dataset

In [None]:
# find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var1):
  sum = sum + i
  if(sum>________):
    print("Number of PCs that explain at least 70% variance: ", ix+1)
    break

In [None]:
#Making a new dataframe with first 8 principal components and original features as indices
cols = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5']

pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:5],2), index=data_scaled.columns, columns=cols)

### Question 4 : Interpret the coefficients of Five principal components from the below dataframe.

In [None]:
def color_high(val):
    if val <= -0.25: # you can decide any value as per your understanding
        return 'background: pink'
    elif val >= 0.25:
        return 'background: skyblue'   
    
pc1.style.applymap(color_high)