<a name="home"></a>
# Data Analysis with Python
<hr \>

## Table of Content
1. [Importing Datasets](#imp)
2. [Pre-processing Data in Python](#prep)
3. [Exploratory Data Analysis](#eda)
4. [Model Development](#mod)
5. [Measures for In-Sample Evaluation](#insamp)
6. [Model Evaluation](#eva)


<a name="imp"></a>
## Importing  Datasets
Data analysis and, in essence data science, helps us unlock the information and insights from raw data to answer our questions. So data analysis plays an important role by helping us to 
* discover useful information from the data, 
* answer questions, 
* and predict the future or the unknown  

### Understanding Data
#### Python Packages
A Python library is a collection of functions and methods that allow you to perform lots of actions without writing any code. The libraries usually contain built in modules providing different functionalities which you can use directly. 

And there are extensive libraries offering a broad range of facilities. We have divided the Python data analysis libraries into three groups. 
1. <b>Scientific computing libraries</b>
     1.1 <b>Pandas</b> 
         * offers data structure and tools for effective data manipulation and analysis. 
         * It provides facts, access to structured data.
         * The primary instrument of Pandas is the two dimensional table consisting of column and row labels, which are called a data frame. 
         * It is designed to provided easy indexing functionality. 
     1.2 The <b>NumPy</b> library 
         * uses arrays for its inputs and outputs. 
         * can be extended to objects for matrices and 
         * with minor coding changes, developers can perform fast array processing. 
     1.3 <b>SciPy</b> includes 
         * functions for some advanced math problems such as: Integrals, differential equations and optimisations
         * data visualization  
2. <b>Data visualization libraries</B>
    These libraries enable you to create graphs, charts and maps. 
    2.1 </b>Matplotlib package</b> 
        * is the most well known library for data visualization. 
        * It is great for making graphs and plots. 
        * The graphs are customizable. 
    2.2 <b>Seaborn</b>
        * Is based on Matplotlib
        * Easy to generate various plots such as heat maps, time series and violin plots.
3. <b>Algorithmic libraries</b>
    The algorithmic libraries tackles the machine learning tasks from basic to complex. 
    3.1 <b>Scikit-learn library</b> contains tools statistical modeling, including regression, classification, clustering, and so on. This library is built on NumPy, SciPy and Matplotib.
    3.2 <b>Statsmodels</b> is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

#### Importing and Exporting Data in Python
<b>Data acquisition</b> is a process of loading and reading data into notebook from various sources. To read any data using Python's pandas package, there are two important factors to consider, <b>format and file path</b>. 
* <b>Format</b> is the way data is encoded. Some common encodings are: CSV, JSON, XLSX, HDF ...
* <b>Path</b> tells us where the data is stored  

In pandas, the <b>read_CSV method</b> can read in files with columns separated by commas into a pandas data frame. 
1. import pandas 
2. define a variable with a file path
3. use the read_ CSV method to import the data

After reading the dataset, it is a good idea to look at the data frame to get a better intuition and to ensure that everything occurred the way you expected. 
* df prints the entire dataset 
* df.head(n) shows the first n rows of the data frame. 
* df.tail(n)shows the bottom end rows of data frame.

### Start analysing Data in Python
Pandas has several built-in methods that can be used to understand the datatype or features or to look at the distribution of data within the dataset. Using these methods, gives an overview of the dataset and also point out potential issues such as the wrong data type of features which may need to be resolved later on. 

Data has a variety of types. The main types stored in Pandas' objects are 
* object, 
* float, 
* Int, 
* datetime

The statistical metrics can tell the data scientist if there are mathematical issues that may exist such as extreme outliers and large deviations. 

To get the quick statistics, we use the describe method (<b>df.describe(inlcude="all")</b>). It returns the number of terms in the column as 
* count, 
* average 
* mean
* standard deviation
* maximum 
* minimum
* boundary of each of the quartiles

By default, the dataframe.describe functions skips rows and columns that do not contain numbers. It is possible to make the describe method worked for object type columns as well. To enable a summary of all the columns, we could add an argument. Include equals all inside the describe function bracket. 

A different set of statistics is evaluated, like 
* unique, 
* top, 
* frequency  

### Accessing Databases with Python
The Python code connects to the database using API calls (application programming interface), which is a set of functions that you can call to get access to some type of servers.  
The two main concepts in the Python DB-API are <b>connection and query objects</b>. 
* You use connection objects to connect to a database and manage your transactions. 
* Cursor objects are used to run queries. You open a cursor object and then run queries. 

Exercise
1. import pandas library
    import pandas as pd
2. Read the online file by the URL provides above, and assign it to variable "df"
    other_path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
    df = pd.read_csv(other_path, header=None)
3. create headers list
    headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]  
    print("headers\n", headers)  
    df.columns = headers  
    df.head(10)  
4. drop missing values along the column "price"
    df.dropna(subset=["price"], axis=0)  
5. Find the names of the columns  
    print(df.columns)
6. Save a Dataset
    df.to_csv("automobile.csv", index=False)
    
| Data Formate  | Read           | Save             |
| ------------- |:--------------:| ----------------:|
| csv           | `pd.read_csv()`  |`df.to_csv()`     |
| json          | `pd.read_json()` |`df.to_json()`    |
| excel         | `pd.read_excel()`|`df.to_excel()`   |
| hdf           | `pd.read_hdf()`  |`df.to_hdf()`     |
| sql           | `pd.read_sql()`  |`df.to_sql()`     |
| ...           |   ...          |       ...        |

7. Data types
    df.dtypes or print(df.dtypes)
8. Statistics
    dataframe.describe()  
    df.describe()  
    df.describe(include = "all")  
9. Statistics over specific columns  
    df[['length','compression-ratio']].describe()
10. Info
    dataframe.info, e.g. df.info

[Home](#home)

<a name="prep"></a>
# Pre-processing Data in Python

It is the process of converting or mapping data from one raw form into another format to make it ready for further analysis. Data preprocessing is often called data cleaning or data wrangling.

1. identify and handle missing values. 
2. data formats, standardize the values into the same format, or unit, or convention 
3. data normalization (centering and scaling
4. Data Binning
5. Turning categorical values into numeric variables to make statistical modeling easier

### Dealing with missing values
Usually missing value in data set appears as question mark and a zero or just a blank cell.
Strategies:
* check if the person or group that collected the data can go back and find what the actual value is
* Drop the missing value
    * drop the variable
    * drop the data entry
* Replace the missing value
    * replace it with an average
    * replace it by frequency
    * repalce it based on other functions
* Leave it as missing data

To remove data that contains missing values Panda's library has a built-in method called df.dropna(). You'll need to specify axis=0 to drop the <b>rows</b> or axis=1 to drop the <b>columns</b> that contain the missing values. To modify the dataframe, you have to set the parameter inplace=True.

To replace missing values like NaNs with actual values, Pandas library has a built-in method called replace "dataframe.replace(missing_value, new_value)" which can be used to fill in the missing values with the newly calculated values.

### Data formating in Python
Data formatting means bringing data into a common standard of expression that allows users to make meaningful comparisons. As a part of dataset cleaning, data formatting ensures the data is consistent and easily understandable.
* Convert data -> df["city-mpg"]= 235/df["city-mpg"] plus rename column name df.rename(columns={"city_mpg: "city-L/100km"}, inplace=True)
* Convert data type -> df["price"]=df["price"].astype("int")

### Data normalisation
By making the ranges consistent between variables, normalization enables a fair comparison between the different features, making sure they have the same impact.This normalization can make some statistical analyses easier down the road. By making the ranges consistent between variables, normalization enables a fair comparison between the different features, making sure they have the same impact.

Three techniques to nomralise value
1. <b>Simple feature scaling</b> just divides each value by the maximum value for that feature:  
    df["length"] = df["length"]/df["length"].max 
2. <b>min-max</b> takes each value X_old subtract it from the minimum value of that feature, then divides by the range of that feature:  
    df["length"] = (df["length"]-df["length"].min())/(df["length"].max-df["length"].min())
3. <b>z-score or standard score</b>. In this formula for each value you subtract the $\mu$ which is the average of the feature, and then divide by the standard deviation $\sigma$ sigma. The resulting values hover around zero:  
    df["length"] = (df["length"]-df["length"].mean())/df["length"].std()

### Binning in Python
Binning is when you group values together into bins. Binning can improve accuracy of the predictive models. In addition, sometimes we use data binning to group a set of numerical values into a smaller number of bins to have a better understanding of the data distribution. 

In Python we can easily implement the binning: We would like 3 bins of equal binwidth, so we need 4 numbers as dividers that are equal distance apart. 
1. First we use the numpy function “linspace” to return the array “bins” that contains 4 equally spaced numbers over the specified interval of the price:  
bins=np.linspace(min(df["price"]),max(df["price"]),4)
2. We create a list “group_names “ that contains the different bin names:  
group_name=["Low","Medium","High"]
3. We use the pandas function ”cut” to segment and sort the data values into bins. You can then use histograms to visualize the distribution of the data after they’ve been divided into bins:  
df["price-binned"] = pd.cut(df[price],bins,lables=group_names,include_lowest=True)

### Turning categorical variables into quantitative variables in Python
Most statistical models cannot take in objects or strings as input and for model training only take the numbers as inputs. We encode the values by adding new features corresponding to each unique element in the original feature we would like to encode. 

In Pandas, we can use get_dummies method to convert categorical variables to dummy variables. In Python, transforming categorical variables to dummy variables is simple. The get_dummies method automatically generates a list of numbers, each one corresponding to a particular category of the variable.  

pd.get_dummies(df['fuel'])

<b>Exercise</b>
1. Import the libraries:  
    import pandas as pd  
    import matplotlib.pylab as plt
2. Reading the dataset:  
    filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"  
    headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]  
     df = pd.read_csv(filename, names = headers)
3. See what the data looks like:
    df.head()
4. Identify and handle missing data
    df.replace("?", np.nan, inplace = True)
5. Evaluate for missing data
    missing_data = df.isnull()
    Count for missing values in each column:
        for column in missing_data.columns.values.tolist():  
        print(column)  
        print (missing_data[column].value_counts())  
        print("")  
6. Deal with missing data
    * drop data
    * replace data
    replace by mean:  
        avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)  
        df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)  
7. Correct data format
    List the data formats: df.dtypes()  
    Convert data types to proper format: df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")  
8. Data standardisation  
    df["highway-mpg"] = 235/df["highway-mpg"]  
    df.rename(columns={'highway-mpg':'highway-L/100km'}, inplace=True)  
    df.head()  
9. Data normalisation  
    df['length'] = df['length']/df['length'].max()  
    df['width'] = df['width']/df['width'].max()  
    df['height'] = df['height']/df['height'].max()  
    df[["length","width","height"]].head()
10. Binning
    Plot a histogram to see the best binning options
    %matplotlib inline  
    import matplotlib as plt  
    from matplotlib import pyplot  
    plt.pyplot.hist(df["horsepower"])  
    
    # set x/y labels and plot title  
    plt.pyplot.xlabel("horsepower")  
    plt.pyplot.ylabel("count")  
    plt.pyplot.title("horsepower bins")  
    
    bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)  
    bins  
    group_names = ['Low', 'Medium', 'High']  
    df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )  
    df[['horsepower','horsepower-binned']].head(20)  
    df["horsepower-binned"].value_counts()  
    
    Plot the bins:  
    %matplotlib inline  
    import matplotlib as plt  
    from matplotlib import pyplot  
    pyplot.bar(group_names, df["horsepower-binned"].value_counts())  
    
    # set x/y labels and plot title  
    plt.pyplot.xlabel("horsepower")  
    plt.pyplot.ylabel("count")  
    plt.pyplot.title("horsepower bins")
11. Indicator (=dummy) variable  
    dummy_variable_1 = pd.get_dummies(df["fuel-type"])  
    dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)  
    dummy_variable_1.head()  
    
    df = pd.concat([df, dummy_variable_1], axis=1)  
    df.drop("fuel-type", axis = 1, inplace=True)  
    

[Home](#home)

<a name='eda'></a>
# Exploratory Data Analysis
Exploratory data analysis (EDA) is an approach to analyze data in order to 
* summarize main characteristics of the data, 
* gain better understanding of the data set, 
* uncover relationships between different variables, 
* extract important variables for the problem we're trying to solve.

EDA is about about 
* <b>descriptive statistics</b>, which describe basic features of a data set, and obtain a short summary about the sample and measures of the data
* <b>basic of grouping data</b> using GroupBuy, and how this can help to transform our data set
* <b>ANOVA</b>, the analysis of variance, a statistical method in which the variation in a set of observations is divided into distinct components
* <b>correlation</b> between different variables
* <b>advance correlation</b>, where we'll introduce you to various correlations statistical methods namely, Pearson correlation, and correlation heatmaps.

### Descriptive Statistics
Descriptive statistical analysis helps to describe basic features of a dataset and obtains a short summary about the sample and measures of the data. 

<b>df.describe()</b>  
It shows the mean, the total number of data points, the standard deviation, the quartiles, and the extreme values. Any NaN values are automatically skipped in these statistics.

<b>value_counts()</b>  
Categorical variables in your dataset can be divided up into different categories or groups and have discrete values.

<b>Box plots -- sns.boxplot (x,y,data)</b>
Box plots are great way to visualize numeric data, since you can visualize the various distributions of the data. The main features that the box plot shows are the 
* median of the data, which represents where the middle data point is. 
* The upper quartile shows where the 75th percentile is. 
* The lower quartile shows where the 25th percentile is. 
* The data between the upper and lower quartile represents the interquartile range.
With box plots, you can easily spot outliers and also see the distribution and skewness of the data. Box plots make it easy to compare between groups

<b>Scatter plot -- plt.scatter(x,y)</b>
Each observation in the scatter plot is represented as a point. This plot shows the relationship between two variables. The predictor variable is the variable that you are using to predict an outcome. In this case, our predictor variable is the engine size. The target variable is the variable that you are trying to predict. In this case, our target variable is the price, since this would be the outcome. In a scatter plot, we typically set the predictor variable on the x-axis or horizontal axis, and we set the target variable on the y-axis or vertical axis.  

### GroupBy in Python
In Pandas, this can be done using the group by method <b>df.groupby()</b>. The group by method is used on 
* categorical variables, 
* groups the data into subsets according to the different categories of that variable. 
* group by a single variable or you can group by multiple variables by passing in multiple variable names. 

#### Visualising the GroupBy data
1. Pivot tables:
    Transform this table to a pivot table by using the pivot method. This is similar to what is usually done in Excel spreadsheets:  
    df_pivot = df_grp.pivot(indes='X', columns='Y')
2. Heat map plot:  
    Heat map takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points. It is a great way to plot the target variable over <b>multiple variables</b> and through this get visual clues with the relationship between these variables and the target:  
    plt.pcolor(df_pivot, cmap='RdBu')  
    plt-colorbar()  
    plt.show()
    
### Correlation
Correlation is a statistical metric for measuring to what extent different variables are interdependent. It is important to know that correlation doesn't imply causation.

We can use seaborn.regplot to create the scatter plot: sns.regplot(x='x-name',y='y-name',data=df), plt.ylim(0,)

### Correlation Statistics
One way to measure the strength of the correlation between continuous numerical variable is by using a method called <b>Pearson correlation</b>. Pearson correlation method will give you two values: the correlation coefficient and the P-value. 

1. Correlation coefficient:
    * a value close to +1 implies a large positive correlation, 
    * a value close to -1 implies a large negative correlation, 
    * a value close to 0 implies no correlation between the variables. 
2. P-value will tell us how certain we are about the correlation that we calculated. 
    * a P-value < 0.001 gives us a <b>strong</b> certainty  
    * a P-value > 0.001 and <0.05 gives us <b>moderate</b> certainty. 
    * a P-value > 0.05 and <0.1 gives us a <b>weak</b> certainty. 
    * a P-value > 0.1 will give us <b>no</b> certainty of correlation at all.  

pearson_coef, p_value = stats.pearsonr(df['x-variable'],df['y-variable'])

-> Correlation heatmap
The color scheme indicates the Pearson correlation coefficient, indicating the strength of the correlation between two variables. We can see a diagonal line with a dark red color, indicating that all the values on this diagonal are highly correlated. This makes sense because when you look closer, the values on the diagonal are the correlation of all variables with themselves, which will be always 1. This correlation heatmap gives us a good overview of how the different variables are related to one another and, most importantly, how these variables are related to price.

### Analysis of Variance (ANOVA)
Analyze a categorical variable and see the correlation among different categories.
To analyze categorical variables such as the make variable, we can use a method such as the <b>ANOVA</b> method. ANOVA is statistical test that stands for Analysis of Variance. ANOVA can be used to find the correlation between different groups of a categorical variable.  
The ANOVA test returns two values, the F-test score and the p-value. 
1. The <b>F-test</b> calculates the ratio of variation between groups mean, over the variation within each of the sample groups. 
2. The <b>p-value</b> shows whether the obtained result is statistically significant. 
Without going too deep into the details, the F-test calculates the ratio of variation between groups means over the variation within each of the sample group means. 

Exercise:
1. Import libraries
    import pandas as pd  
    import numpy as np  
2. Look at the data  
    path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'  
    df = pd.read_csv(path)  
    df.head()
3. Install seaborn  
    %%capture  
    ! pip install seaborn  
4. Import visualisation packages  
    import matplotlib.pyplot as plt  
    import seaborn as sns  
    %matplotlib inline 
5. list the data types  
    print(df.dtypes)
6. Correlation  
    df.corr() or df[['x-value','y-value' ,'z-value','etc-value']].corr()
7. Scatter plot  
    sns.regplot(x="engine-size", y="price", data=df)  
    plt.ylim(0,)  
8. Categorical variables with boxplot  
    sns.boxplot(x="body-style", y="price", data=df)
9. Descriptive Statistics  
    df.describe() and with [Objects] df.describe(include=['object'])  
10. Value Counts  
    df['drive-wheels'].value_counts() or in a frame df['drive-wheels'].value_counts().to_frame()  
11. Grouping - what categories 
    df['drive-wheels'].unique()  
12. Create a group with different categories  
    df_group_one = df[['drive-wheels','body-style','price']]
13. Group multiple variables  
    df_gptest = df[['drive-wheels','body-style','price']]  
    grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()  
    grouped_test1  
14. Group into a pivot  
    grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')  
    grouped_pivot  
15. Heatmap  
    plt.pcolor(grouped_pivot, cmap='RdBu')  
    plt.colorbar()  
    plt.show()
16. Heatmap - fine tuned  
    fig, ax = plt.subplots()  
    im = ax.pcolor(grouped_pivot, cmap='RdBu')  
    #label names  
    row_labels = grouped_pivot.columns.levels[1]  
    col_labels = grouped_pivot.index  
    #move ticks and labels to the center  
    ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)  
    ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)  
    #insert labels  
    ax.set_xticklabels(row_labels, minor=False)  
    ax.set_yticklabels(col_labels, minor=False)  
    #rotate label if too long  
    plt.xticks(rotation=90)  
    fig.colorbar(im)  
    plt.show()
17. Pearson-Correlation  
    from scipy import stats  
    pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])  
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
18. ANOVA  
    grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])  
    f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
    print( "ANOVA results: F=", f_val, ", P =", p_val) 

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant is our calculated score value.

[Home](#home)

<a name='mod'></a>
# Model Development
A model or estimator can be thought of as a 
* mathematical equation used to predict the value given one or more other values
* Relating one or more independent variables or features to dependent variables.

You'll learn about 
* simple and multiple linear regression, 
* model evaluation using visualization, 
* polynomial regression and pipelines, 
* R-squared and MSE for in-sample evaluation, 
* prediction and decision making, and how you can determine a fair value

### Linear Regression
Linear regression will refer to one independent variable to make a prediction. Multiple linear regression will refer to multiple independent variables to make a prediction. Simple linear regression (SLR) is a method to help us understand the relationship between two variables
* The predictor independent variable $x$ and 
* the target dependent variable $y$.

$ y = b_0 + b_1 x$

$b_0$: the interceptor  
$b_1$: the slope  

In order to determine the line, we take data points from our data set marked in red here. We then use these training points to fit our model. The results of the training points are the parameters. We usually store the data points into data frame or numpy arrays. 
The value we would like to predict is called the target that we store in the <b>array y</b>. We store the dependent variable in the data frame or <b>array x</b>. Each sample corresponds to a different row in each data frame or array.  

We have a set of training points. We use these training points to fit or train the model and get parameters. We then use these parameters in the model. We now have a model. We use the $\hat{}$ on the $\hat{y}$ to denote the model is an estimate. We can use this model to predict values that we haven't seen.

$ \hat{y} = b_0 + b_1 x$  

To fit the model in Python, first we import 
1. linear_model from sklearn  
    from sklearn.linear_model import LinearRegression
2. then create a linear regression object using the constructor  
    lm = LinearRegression()
3. Define the predictor and target variable
    x = df[['x-variables']] 
    y = df['y-variable']
4. Use the method fit to fit the model and find the parameters $b_0$ and $b_1$.  
    lm.fit(x,y)  
5. We can obtain prediction using the method predict. The output is an array. The array has the same number of samples as the input x. The intercept $b_0$ is an attribute of the object lm. The slope $b_1$ is also an attribute of the object lm.  
    $ \hat{y}$ = lm.predict(x)  

### Multiple Regression
Multiple linear regression (MLR) is used to explain the relationship between one continuous target y variable and two or more predictor x variables. If we have for example 4 predictor variables then 
* $b_0$ intercept x = zero 
* $b_1$ the coefficient or parameter of $x_1$
* $b_2$ the coefficient of parameter $x_2$ and so on

$ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + ...$  

1. extract the Var1-n predictor variables and store them in the variable z  
    Z = df[['VarX1','VarX2','VarX3']]  
2. train the model using the method train with the features or dependent variables 
    lm.fit(Z, df['Y'])
3. Obtain a prediction  
    $ \hat{y}$ = lm.predict(X)  

### Model Evaluation Using Visualisation
Regression plots are a good estimate of the relationship between two variables, the strength of the correlation, and the direction of the relationship (positive or negative). The horizontal axis is the independent variable. The vertical axis is the dependent variable. Each point represents a different target point. The fitted line represents the predicted value. There are several ways to plot a regression plot.

<b>Regression plot</b>
1. Import <b>seaborn<b>  
    import seaborn as sns  
2. Then use the "regplot" function. The parameter $x$ is the name of the column that contains the independent variable or feature. The parameter $y$, contains the name of the column that contains the name of the dependent variable or target. The parameter data is the name of the dataframe. The result is given by the plot.  
    sns.regplot(x='xVar',y='yVar',data=df)
    plt.ylim(0,)  

<b>Residual plot</b>
The difference between the observed value of the dependent variable ($y$) and the predicted value ($ \hat{y}$) is called the residual (e). Each data point has one residual.  
    $e = y - \hat{y} $ 

Both the <b>sum</b> and the <b>mean</b> of the residuals are equal to zero. That is, $\sum e = 0$ and $ \overline e = 0.$  
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a <b>residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate</b> for the data; otherwise, a nonlinear model is more appropriate.

The residual plot represents the error between the actual value. Examining the predicted value and actual value we see a difference. We obtain that value by subtracting the predicted value, and the actual target value.  
1.  Import <b>seaborn</b>  
    import seaborn as sns  
2. The first parameter is a series of dependent variable or feature. The second parameter is a series of dependent variable or target. We see in this case, the residuals have a curvature.  
    sns.residplot(df['VarX1'], df['Target'])

<b>Distribution plot</b>  
How do we visualize a model for Multiple Linear Regression? One way to look at the fit of the model is by looking at the <b>distribution plot</b>: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.  
![Residual plot](https://static-01.hindawi.com/articles/jat/volume-2020/8953182/figures/8953182.fig.0015.svgz)

A distribution plot counts the predicted value versus the actual value. We examined the vertical axis. We then count and plot the number of predicted points that are approximately equal to one.  
The values of the targets and predicted values are continuous. A histogram is for discrete values. Therefore, pandas will convert them to a distribution. The vertical axis is scaled to make the area under the distribution equal to one.
    
### Polynomial Regression and Pipelines
Polynomial Regression is a form of linear regression in which the relationship between the independent variable $x$ and dependent variable $y$ is modeled as an $n^{th}$ degree polynomial. Polynomial regression is a special case of the general linear regression.  

This method is beneficial for describing curvilinear relationships which is what you get by squaring(^2) or setting higher order terms of the predictor variables in the model transforming the data.  

The model can be quadratic, which means that the predictor variable in the model is squared. We use a bracket to indicate it as an exponent. 
* The model can be quadratic - $2^nd$ order  
    $ \hat{y} = b_0 + b_1 x_1 + b_2 (x_1)^2$  
* The model can be cubic - $3^rd$ order  
    $ \hat{y} = b_0 + b_1 x_1 + b_2 (x_1)^2 + b_3 (x_1)^3$ 
* There also exists <b>higher order polynomial regressions</b>  
    $ \hat{y} = b_0 + b_1 x_1 + b_2 (x_1)^2 + b_3 (x_1)^3 + ...$ 

in Python  
    f = np.polyfit(x,y,3)  
    p = np.polyld(f)  
    print(p)  

    
### Pipline  
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’.  
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.  
We create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.  
    Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

[Home](#home)

<a name='insamp'></a>
# Measures for In-Sample Evaluation
In-sample evaluation tells us how well our model fits the data already given to train it. It does not give us an estimate of how well the train model can predict new data. 

We want to numerically evaluate our models. Let’s look at some of the measures that we use for in-sample evaluation. These measures are a way to numerically determine how good the model fits on our data. Two important measures that we often use to determine the fit of a model are: 
* Mean Square Error (MSE)
* R-squared

<b>MSE</b>  
from sklearn.metrics import mean_squared_error
mean_squared_error(df['VarX'],Y_predict_simple_fit)

<b>$R^2$</b>  
R-squared is also called the coefficient of determination. It’s a measure to determine how close the data is to the fitted regression line.
X = df[['xVar']]  
Y = df['yVar']  
lm.fit(X,Y)  
lm.score(X,Y)  

# Predicting and Decision taking
How can we determine if our model is correct? The first thing you should do is make sure your model results make sense. You should always use:  
* visualization  
* numerical measures for evaluation and  
* comparing between different models  

<b>Exercise (SLR)</b>  
1. Import libraries  
    import pandas as pd  
    import numpy as np  
    import matplotlib.pyplot as plt  
2. Build dataframe  
    path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'  
    df = pd.read_csv(path)  
    df.head()
3. Load module for linear regression
    from sklearn.linear_model import LinearRegression
4. Create liner regression object
    lm = LinearRegression()  
    lm  
5. Define the variables  
    X = df[['highway-mpg']]  
    Y = df['price']
6. Fit the linear model  
    lm.fit(X,Y)
7. Output prediction  
    Yhat=lm.predict(X)  
    Yhat[0:5]   
8. Intercept: lm.intercept_  
    Slope: lm.coef_  

<b>Exercise (MLR)</b>  
1. Define the variable  
    Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]  
2. Fit the model  
    lm.fit(Z, df['price'])
3. Intercept: lm.intercept_
    Values of coefficients b1 ... bn  
    lm.coef_  
4. Visualisation  
    import seaborn as sns  
      
    %matplotlib inline  
    width = 12  
    height = 10  
    plt.figure(figsize=(width, height))  
    sns.regplot(x="highway-mpg", y="price", data=df)  
    plt.ylim(0,)  

<b>Residual plot</b>   
width = 12  
height = 10  
plt.figure(figsize=(width, height))  
sns.residplot(df['highway-mpg'], df['price'])  
plt.show()  

What do we pay attention to when looking at a residual plot? We look at the spread of the residuals: If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

<b>Distribution plot</b>  
Y_hat = lm.predict(Z)  
plt.figure(figsize=(width, height))  
  
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")  
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)  
  
plt.title('Actual vs Fitted Values for Price')  
plt.xlabel('Price (in dollars)')  
plt.ylabel('Proportion of Cars')  
  
plt.show()  
plt.close()  

<b>Distribution plot</b>  
def PlotPolly(model, independent_variable, dependent_variabble, Name):  
    x_new = np.linspace(15, 55, 100)  
    y_new = model(x_new)  

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')  
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')  
    ax = plt.gca()  
    ax.set_facecolor((0.898, 0.898, 0.898))  
    fig = plt.gcf()  
    plt.xlabel(Name)  
    plt.ylabel('Price of Cars')  
  
    plt.show()  
    plt.close()  
    
x = df['highway-mpg']
y = df['price']
  
f = np.polyfit(x, y, 3)  
p = np.poly1d(f)  
print(p)  

PlotPolly(p, x, y, 'highway-mpg')  

np.polyfit(x, y, 3)  

<b>Pipeline</b>  
from sklearn.pipeline import Pipeline  
from sklearn.preprocessing import StandardScaler  

Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]  

pipe=Pipeline(Input)  
pipe  

pipe.fit(Z,y)  

ypipe=pipe.predict(Z)  
ypipe[0:4]  

<b>Measures for In-Sample Evaluation</b>  
Two very important measures that are often used in Statistics to determine the accuracy of a model are:  

* R^2 / R-squared
* Mean Squared Error (MSE


### Find the R^2
lm.fit(X, Y)
print('The R-square is: ', lm.score(X, Y))

### MSE
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])  

### Decision Making: Determining a Good Model Fit  
Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?

#### What is a good R-squared value?
When comparing models, the model with the higher R-squared value is a better fit for the data.

#### What is a good MSE?
When comparing models, the model with the smallest MSE value is a better fit for the data.

Let's take a look at the values for the different models.
Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.49659118843391759
MSE: 3.16 x10^7
Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.

R-squared: 0.80896354913783497
MSE: 1.2 x10^7
Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

R-squared: 0.6741946663906514
MSE: 2.05 x 10^7
Simple Linear Regression model (SLR) vs Multiple Linear Regression model (MLR)
Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and or even act as noise. As a result, you should always check the MSE and R^2.

So to be able to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.

MSEThe MSE of SLR is 3.16x10^7 while MLR has an MSE of 1.2 x10^7. The MSE of MLR is much smaller.
R-squared: In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.497) is very small compared to the R-squared for the MLR (~0.809).
This R-squared in combination with the MSE show that MLR seems like the better model fit in this case, compared to SLR.

#### Simple Linear Model (SLR) vs Polynomial Fit
MSE: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.
R-squared: The R-squared for the Polyfit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting Price with Highway-mpg as a predictor variable.

#### Multiple Linear Regression (MLR) vs Polynomial Fit
MSE: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
R-squared: The R-squared for the MLR is also much larger than for the Polynomial Fit.

[Home](#home)

<a name='eva'></a>
# Model Evaluation
Model evaluation tells us how our model performs in the real world.

Separating data into training and testing sets is an important part of model evaluation. 

We use training set to build a model and discover predictive relationships. We then use a testing set to evaluate model performance. When we have completed testing our model, we should use all the data to train the model. 

A popular function, in the scikit-learn package for splitting datasets, is the train test split function. This function randomly splits a dataset into training and testing subsets. 

imported from sklearn.cross-validation. The input parameters y_data is the target variable. In the car appraisal example, it would be the price and x_data, the list of predictive variables. In this case, it would be all the other variables in the car dataset that we are using to try to predict the price. The output is an array. x_train and y_train the subsets for training. x_test and y_test the subsets for testing.

cross-validation. One of the most common out of sample evaluation metrics is cross-validation. In this method, the dataset is split into K equal groups. Each group is referred to as a fold. For example, four folds. Some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set, which we use to test the model.

cross_ val_predict function. The input parameters are exactly the same as the cross_val_score function, but the output is a prediction

## Overfitting and Underfitting
The goal of Model Selection is to determine the order of the polynomial to provide the best estimate of the function y(x).

* Underfitting: where the model is too simple to fit the data. 
* Overfitting: where the model is too flexible and fits the noise rather than the function. 

plot of the mean square error for the training and testing set of different order polynomials. The horizontal axis represents the order of the polynomial. The vertical axis is the mean square error. The training error decreases with the order of the polynomial. The test error is a better means of estimating the error of a polynomial. The error decreases 'til the best order of the polynomial is determined. Then the error begins to increase. We select the order that minimizes the test error.

## Ridge Regression
Ridge regression prevents overfitting. We will focus on polynomial regression for visualization, but overfitting is also a big problem when you have multiple independent variables, or features.

Ridge regression controls the magnitude of these polynomial coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training the model. Each row in the following table represents an increasing value of alpha.

To make a prediction using ridge regression, import ridge from sklearn.linear_models. Create a ridge object using the constructor. The parameter alpha is one of the arguments of the constructor. We train the model using the fit method. To make a prediction, we use the predict method. In order to determine the parameter alpha, we use some data for training.

## Gride Search
Scikit-learn has a means of automatically iterating over these hyperparameters using cross-validation. This method is called Grid Search. Grid Search takes the model or objects you would like to train and different values of the hyperparameters. It then calculates the mean square error or R-squared for various hyperparameter values, allowing you to choose the best values.

<b>Exercise</b>
1. Perp-work
    import pandas as pd  
    import numpy as np
2. Import clean data  
    path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/module_5_auto.csv'  
    df = pd.read_csv(path)  
    df.to_csv('module_5_auto.csv')
    df=df._get_numeric_data()  
    df.head()  
3. Install libraries for plotting
    %%capture  
    ! pip install ipywidgets
    
    from IPython.display import display  
    from IPython.html import widgets  
    from IPython.display import display  
    from ipywidgets import interact, interactive, fixed, interact_manual
4. Defining functions for plotting  
    def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):  
    width = 12  
    height = 10  
    plt.figure(figsize=(width, height))  
    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)  
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)  
    plt.title(Title)  
    plt.xlabel('Price (in dollars)')  
    plt.ylabel('Proportion of Cars')  
    plt.show()  
    plt.close()  
    
    def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):  
    width = 12  
    height = 10  
    plt.figure(figsize=(width, height))  
    #training data  
    #testing data  
    # lr:  linear regression object  
    #poly_transform:  polynomial transformation object  
    xmax=max([xtrain.values.max(), xtest.values.max()])  
    xmin=min([xtrain.values.min(), xtest.values.min()])  
    x=np.arange(xmin, xmax, 0.1)  
    plt.plot(xtrain, y_train, 'ro', label='Training Data')  
    plt.plot(xtest, y_test, 'go', label='Test Data')  
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')  
    plt.ylim([-10000, 60000])  
    plt.ylabel('Price')  
    plt.legend()
5. Training and Testing
    Split into training and testing data
        y_data = df['price']  
        x_data=df.drop('price',axis=1)  
        
        from sklearn.model_selection import train_test_split  
        x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)  
        print("number of test samples :", x_test.shape[0])  
        print("number of training samples:",x_train.shape[0])
6. Import the LinearRegression
    from sklearn.linear_model import LinearRegression
    lre=LinearRegression() #create linear regression object
    lre.fit(x_train[['horsepower']], y_train) # fit the model
7. Caluculate the R^2  
    lre.score(x_test[['horsepower']], y_test)  #on the test data
    lre.score(x_train[['horsepower']], y_train)  # on the training data

<b>Cross-validation Score</b>
from sklearn.model_selection import cross_val_score  
Rcross = cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)  
print("The mean of the folds are", Rcross.mean(), "and the standard deviation is" , Rcross.std())  
-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')  

Using corss-val-predict  
from sklearn.model_selection import cross_val_predict  
yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)
yhat[0:5]

<b>Overfitting, Underfitting and Model selection</b>  
1. Create a multiple linear regression object   
    lr = LinearRegression()  
    lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)  
2. Predict using training data  
    yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])  
    yhat_train[0:5]
3. Import libraries  
    import matplotlib.pyplot as plt  
    %matplotlib inline  
    import seaborn as sns  
4. Examine the distribution from the training set: 
    Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'  
    DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)
5. Examine the distribution from the testing set: 
    Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'  
    DistributionPlot(y_test, yhat_test, "Actual Values (Train)", "Predicted Values (Test)", Title)
6. Overfitting  
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)  
    pr = PolynomialFeatures(degree=5)  
    x_train_pr = pr.fit_transform(x_train[['horsepower']])  
    x_test_pr = pr.fit_transform(x_test[['horsepower']])  
    pr  
    poly = LinearRegression()  
    poly.fit(x_train_pr, y_train)  
    yhat = poly.predict(x_test_pr)  
    yhat[0:5]  
    print("Predicted values:", yhat[0:4])  
    print("True values:", y_test[0:4].values)  
    PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)  
    poly.score(x_train_pr, y_train) # .557  
    #The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting.
      
    R^2 test  
    Rsqu_test = []  
    order = [1, 2, 3, 4]  
    
    for n in order:  
        pr = PolynomialFeatures(degree=n)  
        x_train_pr = pr.fit_transform(x_train[['horsepower']])  
        x_test_pr = pr.fit_transform(x_test[['horsepower']])    
        lr.fit(x_train_pr, y_train)  
        Rsqu_test.append(lr.score(x_test_pr, y_test))
        plt.plot(order, Rsqu_test)  
        plt.xlabel('order')  
        plt.ylabel('R^2')  
        plt.title('R^2 Using Test Data')  
        plt.text(3, 0.75, 'Maximum R^2 ')

<b>Ridge regression</b>


<b>Grid Search</b>


[Home](#home)