First, we import some of the packages that we will use. Packages are code written by someone else that we can use. In this example, we are using the entire "pandas" package and the package is reference by the variable "pd". In addition, we are using the "variance_inflation_factor" method from the "statsmodel"s package.

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

## 1. Importing Data
#### 1.1 Importing data from excel to dataframe
This code chunk imports the sheet "T18" from the excel "t-18-22.xls" and store it in the "type_of_dwelling_df" variable.

In [None]:
type_of_dwelling_df = pd.read_excel('t18-22.xls', 'T18')

## 2. Basic Data Exploration
#### 2.1 First/last 5 records
This code chunk the first 5 records. We can also show the last 5 records using "type_of_dwelling_df.tail()".

In [None]:
type_of_dwelling_df.head() 

#### 2.2 Columns

In [None]:
type_of_dwelling_df.columns

#### 2.3 Basic column statistics

In [None]:
type_of_dwelling_df.describe()

## 3. Using dataframe
### 3.1 Changing column name
Column name should preferably start with alphabet and not contain space

In [None]:
type_of_dwelling_df.columns = ['planning_area', 'flats_1_2_room', 'flats_3_room', 'flats_4_room',
       'flats_5_room_and_executive', 'condominiums_and_other_apartments',
       'landed_properties', 'others']
type_of_dwelling_df.head()

#### 3.2 Getting subset of columns
We can get column by using a dot followed by the column name. If we did not rename the columns, this way of access column will not work for those columns with space and those that starts with a number.

In [None]:
type_of_dwelling_df.planning_area

An alternative to the dot-column name way is using the bracket notation.

In [None]:
type_of_dwelling_df['planning_area']

The bracket notation can also be used to pull a list of columns.

In [None]:
type_of_dwelling_df[['planning_area','others']]

#### 3.2 Calculations
We can perform calculations like finding out the number of households in each planning area. 

In [None]:
total = type_of_dwelling_df.flats_1_2_room \
        + type_of_dwelling_df.flats_3_room \
        + type_of_dwelling_df.flats_4_room \
        + type_of_dwelling_df.flats_5_room_and_executive \
        + type_of_dwelling_df.condominiums_and_other_apartments \
        + type_of_dwelling_df.landed_properties \
        + type_of_dwelling_df.others

total

Instead of writing a code to do a row-wise sum. We can also call the "sum" method to do it. The "axis" of the sum column indicates if it is supposed to perform a row-wise sum (1) or a column-wise sum (0).

In [None]:
numerical_columns = type_of_dwelling_df.columns[1:]
type_of_dwelling_df[numerical_columns].sum(axis=1)

We can also update each column to be a proportion of the total instead of raw number.

In [None]:
for c in numerical_columns:
    type_of_dwelling_df[c] /=total
type_of_dwelling_df.head()

#### 3.3 Creating new data frame

In [None]:
vif_data = pd.DataFrame()
vif_data

#### 3.4 Creating a new column

In [None]:
vif_data["feature"] = numerical_columns
vif_data

In this example, we are using the "variance_inflation_factor" function to calculate the [VIF](https://en.wikipedia.org/wiki/Variance_inflation_factor) (Variance inflation factor)  which is a measurement of multicolliearity introduced by a column. Multicolliearity occurs when one more more columns are providing redundant data. In this example, we obtained the proportion of each type of dwelling. Since they are proportions, their total must add up to 1. Therefore, if one column is removed, no information is lost. As an exmaple, if a + b + c = 1, if a and b are known, we can always calculate c.

Another example of multicolliearity is when two or more columns are highly correlated. Examples of perfect corelation are:
1. Having the two column high in cm and another call height in inches.
2. Having one column qty and the other column cost and the unit cost are all the same.

A rule of thumb is that if a feature has VIF>10 then multicollinearity is high. A cutoff of 5 is also commonly used.

In [None]:
vif_data["VIF"] = [variance_inflation_factor(type_of_dwelling_df[numerical_columns].values, i) for i in range(len(numerical_columns))]        
vif_data

#### 3.5 Removing column(s)

In [None]:
type_of_dwelling_df.drop('flats_4_room', inplace=True, axis=1)

In this below code chunk, we are importing the the "T22" sheet in the "t18-22.xls", making the numbers a proportion of the area's total household,  keeping only the "Planning Area" and "$10,000 & over" columns and rename the two columns appropriately.

In [None]:
household_income_df = pd.read_excel('t18-22.xls', 'T22')
total = household_income_df[household_income_df.columns[1:]].sum(axis=1)
for c in household_income_df.columns:
    if c != 'Planning Area':
        household_income_df[c] = household_income_df[c] / total
        
household_income_df = household_income_df[['Planning Area','$10,000 & Over']]
household_income_df.columns = ['planning_area','over_10000']



#### 3.6 Joining two dataframe

In [None]:
df = type_of_dwelling_df.merge(household_income_df, left_on='planning_area', right_on='planning_area')
df.head()

## 4. Basic Data analytics
#### 4.1 Scikit learn aka SK-learn
The [scikit learn](https://scikit-learn.org/stable/) package contains modules that is commonly used in machine learning / data analytics. In this example, we are using the "linear_model" module. 

In [None]:
from sklearn import linear_model
from sklearn import metrics

#### 4.2 Preparing data
X and y are commonly used variable to indict the features (X) and the thing we are learning to predict(y).

In [None]:
feature_columns = df.columns[1:-1]
target_column = df.columns[-1]
X = df[feature_columns]
y = df[target_column]

print(feature_columns)
print(target_column)

#### 4.3 Fitting a Linear Regression model 

In [None]:
reg = linear_model.LinearRegression()
reg = reg.fit(X, y)

#### 4.4 Predicting using fitted model

In [None]:
y_pred = reg.predict(X)
df["prediction"] = y_pred

df

#### 4.5 Evaulating the model
The most basic evaluation matrics is the mean absolute error.

In [None]:
metrics.mean_absolute_error(y, y_pred)

Other evaluation include RMSE (Root mean square error)

In [None]:
metrics.mean_squared_error(y, y_pred)

And R-square value. The r-square, measure how much of the variation in the data is explained by the model. A r-square value of 0 means that the model is as good as just predicting the average, and that none of the variation above or below the average is explained by the model.

In [None]:
reg.score(X, y)

#### 4.6 Interpreting the results

In [None]:
reg.intercept_

In [None]:
pd.DataFrame(df.columns[1:-2], reg.coef_)

The formula generated by the model is as follows:  
Prediction = 0.1717522032461692  
&emsp;&emsp;&emsp;&emsp;&emsp;- 3.499031 * others  
&emsp;&emsp;&emsp;&emsp;&emsp;+ 0.623853 * landed_properties  
&emsp;&emsp;&emsp;&emsp;&emsp;+ 0.407940 * condominiums_and_other_apartments  
&emsp;&emsp;&emsp;&emsp;&emsp;+ 0.134823 * flats_5_room_and_executive  
&emsp;&emsp;&emsp;&emsp;&emsp;- 0.066268 * flats_1_2_room  
&emsp;&emsp;&emsp;&emsp;&emsp;- 0.044635 * flats_3_room  

From this we can say that the model suggests that:
- "Others" has the highest influence on the prediction the greater the value of "others" the smaller the prediction.
- "landed_properties" has the 2nd highest influence. The bigger the value of "landed_propertes" the bigger the prediction.
- "condominiums_and_other_apartments" has the 3rd highest influence. The bigger the value of "condominiums_and_other_apartments" the bigger the prediction.
- For "flats_1_2_room" and "flats_3_room", the larger the value the smaller the prediction.