This report is written by Ye Yuan V00886654, Yiliang Liu V00869672, Weiyi Zhang V00868237

## Overview
> 1. Download the Communities and Crime data1
from https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. Use the frst 1495 rows of data as the training set and the rest as the test set.  
2. The data set has missing values. Use a data imputation technique to deal with the missing values in the data set. The data description mentions some features are nonpredictive. Ignore those features.  
3. Plot a correlation matrix for the features in the data set.   
4.  Calculate the Coefficient of Variation CV for each feature, where CV = s/m, in which s is sample variance and m is sample mean..  
5.  Pick 128 features with highest CV , and make scatter plots and box plots for them.
6.  Fit a linear model using least squares to the training set and report the test error.    
7.  Fit a ridge regression model on the training set, with \$\lambda\$  chosen by cross-validation. Report the test error obtained.  
8.  Fit a LASSO model on the training set, with \$\lambda\$ chosen by cross-validation. Report the test error obtained, along with a list of the variables selected by the model. Repeat with standardized features. Report the test error for both cases and compare them.
9.  Fit a PCR model on the training set, with M (the number of principal components) chosen by cross-validation. Report the test error obtained.  
10. In this section, we would like to fit a boosting tree to the data. As in classification trees, one can use any type of regression at each node to build a multivariate regression tree. Because the number of variables is large in this problem, one can use L1-penalized regression at each node. Such a tree is called L1 penalized gradient boosting tree.

*Note: If you want to rerun the code, it may cause error because lack of model.(We encountered this error so we want to mention it here.)

## Data Cleaning and Data Preparation

In [1]:
# import  libaries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.metrics import make_scorer, r2_score, mean_squared_error, auc, mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold
from matplotlib import style
style.use("ggplot")
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [2]:
# read data
df = pd.read_csv('communities.txt',header=None,names=["col_" + str(i) for i in range(127)] + ['goal'])
# set train size 
train_size = 1495
df

FileNotFoundError: ignored

In [None]:
# print df base info
df.describe()

### Pre-process

In [None]:
# Convert ? to NaN
# define function
def getNaN(x):
    if str(x) == '?':return np.nan
    else :return x 
    
# Convert ? to NaN
for i in df.columns:
    df[i] = df[i].apply(getNaN)
df

In [None]:
df_null = df.isnull().sum() / df.shape[0]
df_null = df_null.reset_index().rename(columns = {"index":"columns",0:"ration"})
df_null['rate_num'] = list(df.isnull().sum())
df_null = df_null.sort_values(by=['ration'],ascending=False)
df_null

### Data missing ratio

In [None]:
# rebuild data set 
# ignore nonpredictive features
df1 = df.iloc[:,5:].copy()
df1

### Split data and clean Nan

In [None]:
# fill missing values
df_obj = df1.select_dtypes("object").copy()
# use median to fill missing values
df_obj = df_obj.fillna(df_obj.median())
df_obj

In [None]:
# select float features and use mean to fill missing data
df_float = df1.select_dtypes("float64").copy()
df_float = df_float.fillna(df_float.mean())

In [None]:
# concat obj features and float features
df1 = pd.concat([df_obj,df_float],axis=1)
df1

In [None]:
# get Correlation coefficient about all features
corr = df1.corr()
# plot Correlation coefficient
plt.figure(figsize=(16,12))
sns.heatmap(corr)

In [None]:
# get CV s / m
coe_var = df1.std() / df1.mean()
coe_var = coe_var.reset_index().rename(columns = {"index":"columns",0:"cv"}).sort_values("cv",ascending = False)
coe_var

In [None]:
high_cv = coe_var.iloc[:int(np.sqrt(df1.shape[1] - 1)),0].values
high_cv

## Exploratory Data Analysis

In [None]:
temp = df1[high_cv]
# box plot height cv of features 
plt.figure(figsize=(12,8))
temp.boxplot()

In [None]:
# scatter height cv of features 
sns.pairplot(temp)

## Data Mining

In [None]:
# split train and test 
train = df1.iloc[:train_size,:]
test = df1.iloc[train_size:,:]
train_y = train.pop('goal')
test_y = test.pop("goal")
train_X = train
test_X = test

### 1. Linear Regression

In [None]:
# import libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

# Fit a linear model using least squares to the training set and report the test error.   
lr = LinearRegression(normalize=True)
lr.fit(train_X,train_y)
y_pre = lr.predict(test_X)
# process  abnormal data
y_train = lr.predict(train_X)
y_pre = [0 if i < min(y_train) or i > max(y_train) else i for i in y_pre]
# print train and test error
print("train mse is ",mean_squared_error(train_y,y_train))
print("test mse is ",mean_squared_error(test_y,y_pre))

In [None]:
# use pca model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_predict,cross_val_score,cross_validate
from sklearn.decomposition import PCA

temp = pd.concat([train_X,test_X])
size = train_X.shape[0]

# use pca
pca = PCA(n_components=0.95)
pca.fit(temp)
tt = pca.transform(temp)

train_X = tt[:size,:]
test_X = tt[size:,:]

# Fit a linear model using least squares to the training set and report the test error.   
lr = LinearRegression(normalize=True)
lr.fit(train_X,train_y)
y_pre = lr.predict(test_X)
# process  abnormal data
y_train = lr.predict(train_X)
# print train and test error
print("train mse is ",mean_squared_error(train_y,y_train))
print("test mse is ",mean_squared_error(test_y,y_pre))

###  2. Ridge Regularization

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score


# search for an optimal value of alpha for ridge model
alphas = np.linspace(0.1,10,50)
for k in alphas:
    rd = Ridge(alpha=k)
    # 10 fold cross-validation
    scores = cross_val_score(rd, train_X, train_y, cv=10, scoring='neg_mean_squared_error')
    # print train error
    print("alpha is ",k,"train mse is ",-scores.mean())
# Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
ridge = Ridge(alpha=2.726)
ridge.fit(train_X,train_y)
y_pre = ridge.predict(test_X)
# print test error
print("test mse is ",mean_squared_error(test_y,y_pre))

### 3. Lasso Regularization

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_predict,cross_val_score,cross_validate


# search for an optimal value of alpha for Lasso model
alphas = np.linspace(0.1,10,50)
for k in alphas:
    rd = Lasso(alpha=k)
    scores = cross_val_score(rd, train_X, train_y, cv=10, scoring='neg_mean_squared_error')
    print("alpha is ",k,"train mse is ",-scores.mean())
    
# Fit a LASSO model on the training set, with λ chosen by cross-validation. Report the test error obtained, 
# along with a list of the variables selected by the model. 
lass = Lasso(alpha = 0.5)
lass.fit(train_X,train_y)
y_pre = lass.predict(test_X)
print("test mse is ",mean_squared_error(test_y,y_pre))

### 4. StandardScaler and Lasso

In [None]:
# import libraries
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# standar all train_X and test_X data
# Repeat with standardized features. Report the test error for both cases and compare them.
std = StandardScaler()
ss_train = std.fit_transform(train)
ss_test = std.transform(test)

# search for an optimal value of alpha for Lasso model
alphas = np.linspace(0.1,5,50)
for k in alphas:
    rd = Lasso(alpha=k)
    scores = cross_val_score(rd, ss_train, train_y, cv=10, scoring='neg_mean_squared_error')
    print("alpha is ",k,"train mse is ",-scores.mean())
    
lass = Lasso(alpha=0.1)
lass.fit(ss_train,train_y)
y_pre = lass.predict(ss_test)

print("test mse is ",mean_squared_error(test_y,y_pre))


### Conclusion

We first use linear regression to test the data, and get the result with mse = 0.016643431264981627 for train set and 0.0218767918707957 for test set. Then we fit the data set with PCA model. Then we test the data using linear regression, ridge Regularization,  Lasso Regularization and  StandardScaler and Lasso. We got the mse of the predict as follows.

![predictions.png](attachment:predictions.png)



As we can see, linear regression and ridge Regularization have the smaller MSEs, so for this data set,in these four algorithms, linear regression and ridge Regularization is a better choice.