## 1.Read libraries and the dataset
ライブラリとデータセットを呼び出す。

 ## 2.Data Cleaning and Visualizations
データ数とID数が非合致なため不明処理と重複確認を行う。また、目的変数となる'price'を中心に基礎統計量の把握や関係図を示し、'price'の予測に生データとしては直接関係はなさそうな変数を落としてまずは初期モデルを作ってみる。その後、特徴量エンジニアリングとマルチコ処理を施す。

* 2-1.Exploring nulls and duplications into the dataset.
* 2-2.Visualizing the price
* 2-3.Model building(1st)
* 2-4-1. Feature engineering: "date"
* 2-4-2. Feature engineering: "renovation"
* 2-4-3. Feature engineering: "zipcode"
* 2-4-4. New dataset
* 2-4-5. Detecing multicollinearity

## 3.Model building and Evaluation
予測モデルを作り、生データで作った1stモデルと比較する。

## 1.Read libraries and the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

## 1-1. Load the dataset

In [2]:
df = pd.read_csv("../input/kc_house_data.csv")
df.head()

In [3]:
df.tail()

In [4]:
print(df.shape)
print('------------------------')
print(df.nunique())
print('------------------------')
print(df.dtypes)

データ数は21,613なのに対し、id数は21,436しか存在せず、id数が176も少ない（除、最初の行）。不明データか重複データのいずれかが存在していると考えられる。

## 2.Data Cleaning and Visualisation

 ### 2-1.Exploring nulls and duplications into the dataset.

In [5]:
df.isnull().sum()

In [6]:
df['id'].value_counts()

In [7]:
sum((df['id'].value_counts()>=2)*1)

idに不明データはなく、重複データが176も存在することがわかった。同様に、他の変数にも不明データは存在しなかった。

### 2-2. Visualizing the price

まずは"price"のヒストグラムと基礎統計量を確認。

In [9]:
plt.hist(df['price'],bins=100)

In [10]:
# Seeing the fundamental statistics of price.
df.describe()['price']

右に歪んだ分布となっている。最小値と最大値は100倍以上も違う。
次に、'price'と'data'を除いた'他説明変数'の関係を図示化したい。
　（※'data'は明らかにデータ変換したほうが良いため）

In [11]:
df.corr().style.background_gradient().format('{:.2f}')

In [12]:
for i in df.columns:
    if (i != 'price') & (i != 'date'):
        df[[i,'price']].plot(kind='scatter',x=i,y='price')

'yr_renovated'と 'zipcode'はint64だが, 図を見る限りはただの数字なので特徴量エンジニアリングが必要と考える。

### 2-3. Model Building (1st)

* 初期モデルは、'price'と生データとしては関係のなさそうな'id', 'date','yr_renovated', 'zipcode'の4変数を除いた16の説明変数で作る。

In [13]:
from sklearn.linear_model import LinearRegression
X_1 = df.drop(['price','id','date','yr_renovated','zipcode'],axis=1)
y_1 = df['price']

X_train_1,X_test_1,y_train_1,y_test_1 = train_test_split(X_1,y_1,random_state=42)

regr_train_1=LinearRegression(fit_intercept=True).fit(X_train_1,y_train_1)
y_pred_1 = regr_train_1.predict(X_test_1)

In [14]:
#MAE = mean_absolute_error(y_test,y_pred)
#MSE = mean_squared_error(y_test,y_pred)

MAE_1 = mean_absolute_error(y_test_1,y_pred_1)
MSE_1 = mean_squared_error(y_test_1,y_pred_1)

print('MAE_1:',MAE_1,'/','MSE_1:',MAE_1)

### 2-4-1. Feature engineering: "date"

上述の通り、まず、'date'を曜日と月に変換する。曜日は週末に、月は移転シーズンで販売と関係しそう（'price'は高くなる？低くなる？）

In [15]:
df.date.head()

In [16]:
pd.to_datetime(df.date).map(lambda x:'dow'+str(x.weekday())).head()

** dow：day of week, 0=Monday, 7=Sunday

In [17]:
pd.to_datetime(df.date).map(lambda x:'month'+str(x.month)).head()

** month1=January, 12=December

In [18]:
df['dow'] = pd.to_datetime(df.date).map(lambda x:'dow'+str(x.weekday()))
df['month'] = pd.to_datetime(df.date).map(lambda x:'month'+str(x.month))

次に、1、0データに変換する

In [19]:
pd.get_dummies(df['dow']).head()

In [20]:
pd.get_dummies(df['month']).head()

月はうまくソートされていないが、修正方法を知らないので放置。

### 2-4-2. Feature engineering: "renovation"

'yr_renovated'はそのまま扱うのは難しいので、リノベーションしたか否かで、データ変換する。

In [21]:
df.yr_renovated.head()

In [22]:
df['yr_renovated'].value_counts().sort_index().head()

In [23]:
np.array(df['yr_renovated'] !=0)

In [24]:
np.array(df['yr_renovated'] !=0)*1

In [25]:
df['yr_renovated_bin'] = np.array(df['yr_renovated'] != 0)*1
df['yr_renovated_bin'].value_counts()

### 2-4-3. Feature engineering: "zipcode"

'zipcode'もそのままでは単なる数値なので扱えない。しかし、エリアは'price'は非常に重要な変数なので0、1データに変換して説明変数化する。

In [None]:
df['zipcode'].astype(str).map(lambda x:x).head()

In [None]:
df['zipcode_str'] = df['zipcode'].astype(str).map(lambda x:'zip_'+x)
pd.get_dummies(df['zipcode_str']).head()

### 2-4-4. New dataset

0,1データ化した 'dow', 'month' and 'zipcode'を説明変数として新たにに加える。

In [None]:
df['zipcode_str'] = df['zipcode'].astype(str).map(lambda x:'zip_'+x)
df_en = pd.concat([df,pd.get_dummies(df['zipcode_str'])],axis=1)
df_en = pd.concat([df_en,pd.get_dummies(df.dow)],axis=1)
df_en = pd.concat([df_en,pd.get_dummies(df.month)],axis=1)

うまく加味されたので、元の変数を落とす。

In [None]:
df_en_fin = df_en.drop(['date','zipcode','yr_renovated','month','dow','zipcode_str',],axis=1)

In [None]:
print(df_en_fin.shape)
print('------------------------')
print(df_en_fin.nunique())

In [None]:
df_en_fin.head()

### 2-4-5. Detecing multicollinearity

> マルチコが起こっているかを確認する。

In [None]:
X = df_en_fin.drop(['price'],axis=1)
y = df_en_fin['price']
regr = LinearRegression(fit_intercept=True).fit(X,y)
model_2 = regr.score(X,y)
for i, coef in enumerate(regr.coef_):
    print(X.columns[i],':',coef)

****When seeing the result of regr.coef_, for example, 'bedrooms' is negative against 'price'. Normally 'bedrooms' could be positively proportional with 'price'. However it is caused by strong positive correlation by 0.58 with 'sqft_living'. Because multicollinearity is thought to be occurred in other valuables.
* **In the case of multicollinearity, VIF value should be considered.

In [None]:
df_vif = df_en_fin.drop(["price"],axis=1)
for cname in df_vif.columns:  
    y=df_vif[cname]
    X=df_vif.drop(cname, axis=1)
    regr = LinearRegression(fit_intercept=True)
    regr.fit(X, y)
    rsquared = regr.score(X,y)
    print(cname,":" ,1/(1-np.power(rsquared,2)))

The criteria of multicollinearity is generally over VIF(Variance Inflation Factor) value by 10 or some inf (rsquare==1) are found. Therefore, we derive the valuables to meet criteria of 'rsquare>1.-1e-10', in addition, empirically 'regr.coef_'> |0.5| .

In [None]:
df_vif = df_en_fin.drop(["price"],axis=1)
for cname in df_vif.columns:  
    y=df_vif[cname]
    X=df_vif.drop(cname, axis=1)
    regr = LinearRegression(fit_intercept=True)
    regr.fit(X, y)
    rsquared = regr.score(X,y)
    #print(cname,":" ,1/(1-np.power(rsquared,2)))
    if rsquared > 1. -1e-10:
        print(cname,X.columns[(regr.coef_> 0.5) | (regr.coef_ < -0.5)])

Dropping 'sqft_above','zip_98001', 'month1' and 'dow1'.

In [None]:
df_en_fin = df_en_fin.drop(['sqft_above','zip_98001','month1','dow1'],axis=1)

df_vif = df_en_fin.drop(["price"],axis=1)
for cname in df_vif.columns:  
    y=df_vif[cname]
    X=df_vif.drop(cname, axis=1)
    regr = LinearRegression(fit_intercept=True)
    regr.fit(X, y)
    rsquared = regr.score(X,y)
    #print(cname,":" ,1/(1-np.power(rsquared,2)))
    if rsquared > 1. -1e-10:
        print(cname,X.columns[(regr.coef_> 0.5) | (regr.coef_ < -0.5)])

NO multicollinearity happens!!

## 3.Model building and Evaluation

The model will be built by using train dataset after detecting multicollinearity.  It will be evaluated on  MSE(mean_squared_error) and MAE(mean_squared_error) in order to compare with 1st model.

In [None]:
X_multi = df_en_fin.drop(['price'],axis=1)
y_target = df_en_fin['price']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X_multi,y_target,random_state=42)

In [None]:
regr_train=LinearRegression(fit_intercept=True).fit(X_train,y_train)
y_pred = regr_train.predict(X_test)

In [None]:
MAE_2 = mean_absolute_error(y_test,y_pred)
MSE_2 = mean_squared_error(y_test,y_pred)

print('MAE_2:',MAE_2,"/","MSE_2:",MSE_2)

In [None]:
print('MAE:{:.4f}'.format(1-MAE_2 / MAE_1),'/','MSE:{:.4f}'.format(1-MSE_2 / MSE_1))

## Conclusion 
The prediction model which conducted feature engineering and detected multicollinearity was clearly better than the 1st model. MAE decreased by 23.5% and MSE does by 36.3%.


## Next Issues
1. Exactly MAE and MSE decreased by more than 20%, however, LinearRegression was only tried at this time. Therefore the other methodology should be tried on the purpose to get better.

2.  As there are over 100 of the explanatory variables, overfitting may happen.Therefore the number of variables may need to decrease.