* [1. INTRODUCTION](#1)
* [2. USED LIBRARIES](#2)
* [3. DATA EXPLORATION](#3)
    * [3.1. Detailed Information of the Dataset](#31)
    * [3.2. Correlation of Columns(Attributes)](#32)
    * [3.3. Various Visualizations from Dataset](#33)
* [4. DATA PREPARATION AND CLEANING](#4)
    * [4.1. Drop Irrelevant Columns](#41)
    * [4.2. Correction of Column(Attribute) Names](#42)
    * [4.3. Split Data and Target](#43)
    * [4.4. Handling Missing Values](#44)
    * [4.5. Type Conversions and Encoding](#45)
    * [4.6. Normalization of Data](#46)
    * [4.7. Preparation of Test and Train Data](#47)
* [5. BUILDING MODELS](#5)
    * [5.1. Multiple Linear Regression](#51)
    * [5.2. Polynomial Regression](#52)
    * [5.3. Decision Tree Regression](#53)
    * [5.4. Random Forest Regression](#54)
    * [5.5. Support Vector Regression](#55)
* [6. EVALUATING MODELS](#6)
    * [6.1. Evaluating Multiple Linear Regression Model](#61)
    * [6.2. Evaluating Polynomial Regression Model](#62)
    * [6.3. Evaluating Decision Tree Regression Model](#63)
    * [6.4. Evaluating Random Forest Regression Model](#64)
    * [6.5. Evaluating Support Vector Regression Model](#65)
* [7. EXPLORATION OF RESULTS](#7)
* [8. CONCLUSION](#8)

<a id="1"></a> <br>
## 1. INTRODUCTION

In this study, I will give you an example of EDA (Exploratory Data analysis) and I will make an analysis of the various Regression algorithms in Machine Learning. In this analysis, I will use the "Medical Cost Personal Datasets" dataset. In this dataset, 1338 people have anonymous information. In addition, annual insurance premiums given to these people are also included in the dataset. We will create and test regression models that estimate their annual premiums based on their other information. Then we will discuss the results and see which algorithm is successful. We will also perform visualization using various data in dataset.

<a id="2"></a> <br>
## 2. USED LIBRARIES

This section will give information about Python libraries to be used in the study and these libraries will be imported into the project. Here are the libraries and explanations we will use:

* **NumPy :** This library is actually a dependency for other libraries. The main purpose of this library is to provide a variety of mathematical operations on matrices and vectors in Python. Our project will be used this library to provide support to other libraries.
* **Pandas :** This library performs import and processing of dataset in Python. In our project, it will be used to include the CSV extension dataset in the project and to perform various operations on it.
* **Matplotlib :** This library, which is usually used to visualize data. It will perform the same task in our project.
* **Seaborn :** This library which has similar features to Matplotlib is another library used for data visualization in Python. In our project, it will be used for the implementation of various features not included in the Matplotlib library.
* **Sckit-Learn :** This library includes the implementation of various machine larning algorithms. With this library, we will perform all operations from building to evaluation of regression models using functions and classes in this library.

Now let's import NumPy, Pandas, Matplotlib and Seaborn libraries into our project and get them ready for use:

In [None]:
import numpy as np  # Importing NumPy library
import pandas as pd  # Importing Pandas library
import matplotlib.pyplot as plt  # Importing Matplotlib library's "pyplot" module
import seaborn as sns  # Imorting Seaborn library

# This lines for Kaggle:
import os
print(os.listdir("../input"))

<a id="3"></a> <br>
## 3. DATA EXPLORATION

In this section, various explanations will be made about dataset.

<a id="31"></a> <br>
### 3.1. Detailed Information of the Dataset

Here we will import the dataset first. Then, we will explain the columns(features) of the dataset one by one.

In [None]:
data = pd.read_csv("../input/insurance.csv")  # Read CSV file and load into "data" variable
data.info()  # Show detailed information for dataset columns(attributes)

As you can see from the output here, there are 1338 rows so record. There are also 7 columns(attributes). Fortunately, our dataset doesn't have any missing values. In other words, all the columns of all rows are filled with data. The rows are also indexed from 0 to 1337. Now, let's explain what the columns mean:

* **age :** Indicates the age of the person. It contains data of type "_int64_".
* **sex :** It refers to the gender of the person. It contains "_object_" type data.
* **bmi :** It refers to the Body Mass Index of the person and contains the data of type "_float64_". BMI is a measure of the weight of a person, divided by the square of its length. Determines the person's obesity value. The formula for USA and METRIC units is as follows:
![](https://i2.wp.com/www.marathonnewbie.com/wp-content/uploads/2016/07/body-mass-index-formula.jpg)
* **children :** It refers to the number of children that a person has. It contains data of type "_int64_".
* **smoker :** Indicates whether the person smokes or not. It contains "_object_" type data.
* **region :** Specifies which region the person is from. It contains "_object_" type data.
* **charges :** The person's total insurance premium is specified. Although not specified, it is assumed to be in dollars ($). It contains "__float64__" type data.

We gave the necessary information about dataset. Now, looking at the first 5 and last 5 entries of dataset, what are the values that are being held:

In [None]:
data.head()  # Print first 5 entry of the dataset

In [None]:
data.tail()  # Prints last 5 entries of the dataset

There is no problem with numeric data. However, we may need to apply numerical transformation in the future for categorical data of type "object". Now let's see the various statistics about the numeric data:

In [None]:
data.describe()  # Print table which contain statistical data of the dataset

<a id="32"></a> <br>
### 3.2. Correlation of Columns(Attributes)

In this section, we'll find the correlation matrix between the columns and we'll visualize it into a Heatmap. In this way, we will be able to see the relationship between the attributes more clearly and visualize them in the future.

In [None]:
data.corr()  # Prints correlation matrix for the numerical columns

Correlation is a number that indicates how the two attributes are related to each other. As this number approaches 1.0, the relationship is strengthened in the right direction. As it approaches -1.0, it is strengthened in the opposite direction. If this value is close to zero, the bond between the two data is weak. For example in the above matrix, we see a little (but no more) bound with person's age and charge values. Other bounds are so weak. Now we visualize this correlation matrix with Heatmap:

In [None]:
fig, axes = plt.subplots(figsize=(8, 8))  # This method creates a figure and a set of subplots
sns.heatmap(data=data.corr(), annot=True, linewidths=.5, ax=axes)  # Figure out heatmap
# Parameters:
# data : 2D data for the heatmap.
# annot : If True, write the data value in each cell.
# linewidths : Width of the lines that will divide each cell.
# ax : Axes in which to draw the plot, otherwise use the currently-active Axes.
plt.show()  # Shows only plot and remove other informations

<a id="33"></a> <br>
### 3.3. Various Visualizations from Dataset

In this section, we will make various visualizations related to dataset and after understanding the connection between the data in dataset we will move to the next section. Let's look at the distribution of numerical attributes:

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
data.plot(kind="hist", y="age", bins=70, color="b", ax=axes[0][0])
data.plot(kind="hist", y="bmi", bins=200, color="r", ax=axes[0][1])
data.plot(kind="hist", y="children", bins=5, color="g", ax=axes[1][0])
data.plot(kind="hist", y="charges", bins=200, color="orange", ax=axes[1][1])
plt.show()

We can make inferences from these images. For example, the number of children who have no children in the dataset is more than the others. In addition, the total charge amount usually looks less than 20000. Now look at the male and female numbers in dataset:

In [None]:
sns.catplot(x="sex", kind="count", palette="Set1", data=data)

Since the value we will examine as an output here is "_charges_", we need to examine the relationship of the other columns with it. To do this, let's draw Scatter Plots between the numeric columns and the "_charges_" column:

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 5))
data.plot(kind='scatter', x='age', y='charges', alpha=0.5, color='green', ax=axes[0], title="Age vs. Charges")
data.plot(kind='scatter', x='bmi', y='charges', alpha=0.5, color='red', ax=axes[1], title="Sex vs. Charges")
data.plot(kind='scatter', x='children', y='charges', alpha=0.5, color='blue', ax=axes[2], title="Children vs. Charges")
plt.show()

Finally, look at the distribution of smokers and non-smokers in the BMI vs. Charges Scatter Plot:

In [None]:
sns.scatterplot(x="bmi", y="charges", data=data, palette='Set2', hue='smoker')

<a id="4"></a> <br>
## 4. DATA PREPARATION AND CLEANING

In this section, we will be able to use various preprocesses in order to use the data correctly. First, let us explain what data will be used as a target. Here we will use "age", "sex", "bmi", "children" and "smoker" columns as data ie X. We will use "charges" column as the target ie Y. We will drop the "region" column from this dataset. Because we haven't done much analysis about it, so we can slow down our process. Now, let's do the related operations one by one.

<a id="41"></a> <br>
### 4.1. Drop Irrelevant Columns

In this section, we will delete the columns that will not work for us from the dataset. Here is the only one we will not use the column is "_region_", so we just need to delete it:

In [None]:
data.drop(["region"], axis=1, inplace=True)  # Drop "region" column from dataset

<a id="42"></a> <br>
### 4.2. Correction of Column(Attribute) Names

In this section we will use more readable column names instead existing column names if there are spaces and unwanted characters. There is no problem with column names in the dataset we use. But in order to show how the process is done, I'll change them all to the name I want and all to be capitalized:

In [None]:
data.rename(columns={"age" : "AGE", "sex" : "GENDER", "bmi" : "BMI", "children" : "CHILDREN", "smoker": "SMOKER", "charges" : "CHARGES"}, inplace=True)
data.columns

<a id="43"></a> <br>
### 4.3. Split Data and Target

We can't give the whole dataset to the model as it is. First we need to set the data and target part of it. Here, the data part is called X, while the target part is called Y. Now split the data and target partitions and assign each of them to variables named X and Y:

In [None]:
X = data.drop(["CHARGES"], axis=1)  # Put all data (except "__CHARGES__" column) to the X variable
y = data.CHARGES.values  # Put only "__CHARGES__" column to the Y variable

<a id="44"></a> <br>
### 4.4. Handling Missing Values

There is no missing value in this dataset. Therefore, we do not have any action here.

<a id="45"></a> <br>
### 4.5. Type Conversions and Encoding

Here we will convert categorical data into numeric data. In our dataset, categorical columns are gender and smoking. We will convert them numerically:

* **GENDER :** If "male" is 0, "female" is 1.
* **SMOKER :** If "no" is 0, "yes" is 1.

In [None]:
X.GENDER = [1 if each == "female" else 0 for each in X.GENDER]
X.SMOKER = [1 if each == "yes" else 0 for each in X.SMOKER]
X.head()

<a id="46"></a> <br>
### 4.6. Normalization of Data

The values of the data may be so far from each other. This can sometimes lead to undesirable situations in regression algorithms. Therefore, we need to normalize the data.

In [None]:
X["BMI"] = (X - np.min(X))/(np.max(X) - np.min(X)).values
X.BMI

As you can see, the values were normalized between 0 and 1. Now let's take our Train and Test data and finish this part.

<a id="47"></a> <br>
### 4.7. Preparation of Test and Train Data

The final process here is the smooth and random separation of test and train data. For this, we will benefit from the method named "_train_test_split_" from the Scikit-Learn library. I would like to use 20% of our data for testing and 80% for training purposes. The process is very simple:

In [None]:
from sklearn.model_selection import train_test_split  # Import "train_test_split" method

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Parameters:
# test_size : It decides how many test data in percentage.
# random_state : This parameter can take any value. This value decides randomness seed.

<a id="5"></a> <br>
## 5. BUILDING MODELS

In this section, we will build regression models and fit them wit data. The regression algorithms used in this section are:

1. Multiple Linear Regression
2. Polynomial Regression
3. Decision Tree Regression
4. Random Forest Regression
5. Support Vector Regression

<a id="51"></a> <br>
### 5.1. Multiple Linear Regression

Create the Multiple Linear Regression model and fit the data:

In [None]:
from sklearn.linear_model import LinearRegression  # Import Linear Regression model

multiple_linear_reg = LinearRegression(fit_intercept=False)  # Create a instance for Linear Regression model
multiple_linear_reg.fit(x_train, y_train)  # Fit data to the model

<a id="52"></a> <br>
### 5.2. Polynomial Regression

In the Polynomial Regression model, we must first convert the data to the specified degree as a polynomial feature. For this, we will take advantage of the PolynomialFeatures class of the Scikit-Learn library. Now we will use the X data for training and testing as a polynomial feature and perform the fit process:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=3)  # Create a PolynomialFeatures instance in degree 3
x_train_poly = polynomial_features.fit_transform(x_train)  # Fit and transform the training data to polynomial
x_test_poly = polynomial_features.fit_transform(x_test)  # Fit and transform the testing data to polynomial

polynomial_reg = LinearRegression(fit_intercept=False)  # Create a instance for Linear Regression model
polynomial_reg.fit(x_train_poly, y_train)  # Fit data to the model

<a id="53"></a> <br>
### 5.3. Decision Tree Regression

Create a Decision Tree Regression model and fit the data (we use 13 for randomness seed value):

In [None]:
from sklearn.tree import DecisionTreeRegressor  # Import Decision Tree Regression model

decision_tree_reg = DecisionTreeRegressor(max_depth=5, random_state=13)  # Create a instance for Decision Tree Regression model
decision_tree_reg.fit(x_train, y_train)  # Fit data to the model

<a id="54"></a> <br>
### 5.4. Random Forest Regression

Create a Random Forest Regression model using 400 Estimators to fit the data (we use 13 for randomness seed value):

In [None]:
from sklearn.ensemble import RandomForestRegressor  # Import Random Forest Regression model

random_forest_reg = RandomForestRegressor(n_estimators=400, max_depth=5, random_state=13)  # Create a instance for Random Forest Regression model
random_forest_reg.fit(x_train, y_train)  # Fit data to the model

<a id="55"></a> <br>
### 5.5. Support Vector Regression

Create the Support Vector Regression model and fit the data:

In [None]:
from sklearn.svm import SVR  # Import SVR model

support_vector_reg = SVR(gamma="auto", kernel="linear", C=1000)  # Create a instance for Support Vector Regression model
support_vector_reg.fit(x_train, y_train)  # Fit data to the model

<a id="6"></a> <br>
## 6. EVALUATING MODELS

In this section we will do some measurements to evaluate the performance on the models we fit. In addition, 10-Fold Cross Validation method will perform the validation process. R Squared Score method will be used for calculating the accuracy of the models. Mean Squared Error (MSE) method will be used for error measurement. As the MSE result can be very large, its square root will be taken and converted into RMSE. The error and accuracy calculation shall be performed on both the test and training dataset. Let's first perform evaluation for each model by importing the necessary functions:

In [None]:
from sklearn.model_selection import cross_val_predict  # For K-Fold Cross Validation
from sklearn.metrics import r2_score  # For find accuracy with R2 Score
from sklearn.metrics import mean_squared_error  # For MSE
from math import sqrt  # For squareroot operation

<a id="61"></a> <br>
### 6.1. Evaluating Multiple Linear Regression Model

In [None]:
# Prediction with training dataset:
y_pred_MLR_train = multiple_linear_reg.predict(x_train)

# Prediction with testing dataset:
y_pred_MLR_test = multiple_linear_reg.predict(x_test)

# Find training accuracy for this model:
accuracy_MLR_train = r2_score(y_train, y_pred_MLR_train)
print("Training Accuracy for Multiple Linear Regression Model: ", accuracy_MLR_train)

# Find testing accuracy for this model:
accuracy_MLR_test = r2_score(y_test, y_pred_MLR_test)
print("Testing Accuracy for Multiple Linear Regression Model: ", accuracy_MLR_test)

# Find RMSE for training data:
RMSE_MLR_train = sqrt(mean_squared_error(y_train, y_pred_MLR_train))
print("RMSE for Training Data: ", RMSE_MLR_train)

# Find RMSE for testing data:
RMSE_MLR_test = sqrt(mean_squared_error(y_test, y_pred_MLR_test))
print("RMSE for Testing Data: ", RMSE_MLR_test)

# Prediction with 10-Fold Cross Validation:
y_pred_cv_MLR = cross_val_predict(multiple_linear_reg, X, y, cv=10)

# Find accuracy after 10-Fold Cross Validation
accuracy_cv_MLR = r2_score(y, y_pred_cv_MLR)
print("Accuracy for 10-Fold Cross Predicted Multiple Linaer Regression Model: ", accuracy_cv_MLR)

<a id="62"></a> <br>
### 6.2. Evaluating Polynomial Regression Model

In [None]:
# Prediction with training dataset:
y_pred_PR_train = polynomial_reg.predict(x_train_poly)

# Prediction with testing dataset:
y_pred_PR_test = polynomial_reg.predict(x_test_poly)

# Find training accuracy for this model:
accuracy_PR_train = r2_score(y_train, y_pred_PR_train)
print("Training Accuracy for Polynomial Regression Model: ", accuracy_PR_train)

# Find testing accuracy for this model:
accuracy_PR_test = r2_score(y_test, y_pred_PR_test)
print("Testing Accuracy for Polynomial Regression Model: ", accuracy_PR_test)

# Find RMSE for training data:
RMSE_PR_train = sqrt(mean_squared_error(y_train, y_pred_PR_train))
print("RMSE for Training Data: ", RMSE_PR_train)

# Find RMSE for testing data:
RMSE_PR_test = sqrt(mean_squared_error(y_test, y_pred_PR_test))
print("RMSE for Testing Data: ", RMSE_PR_test)

# Prediction with 10-Fold Cross Validation:
y_pred_cv_PR = cross_val_predict(polynomial_reg, polynomial_features.fit_transform(X), y, cv=10)

# Find accuracy after 10-Fold Cross Validation
accuracy_cv_PR = r2_score(y, y_pred_cv_PR)
print("Accuracy for 10-Fold Cross Predicted Polynomial Regression Model: ", accuracy_cv_PR)

<a id="63"></a> <br>
### 6.3. Evaluating Decision Tree Regression Model

In [None]:
# Prediction with training dataset:
y_pred_DTR_train = decision_tree_reg.predict(x_train)

# Prediction with testing dataset:
y_pred_DTR_test = decision_tree_reg.predict(x_test)

# Find training accuracy for this model:
accuracy_DTR_train = r2_score(y_train, y_pred_DTR_train)
print("Training Accuracy for Decision Tree Regression Model: ", accuracy_DTR_train)

# Find testing accuracy for this model:
accuracy_DTR_test = r2_score(y_test, y_pred_DTR_test)
print("Testing Accuracy for Decision Tree Regression Model: ", accuracy_DTR_test)

# Find RMSE for training data:
RMSE_DTR_train = sqrt(mean_squared_error(y_train, y_pred_DTR_train))
print("RMSE for Training Data: ", RMSE_DTR_train)

# Find RMSE for testing data:
RMSE_DTR_test = sqrt(mean_squared_error(y_test, y_pred_DTR_test))
print("RMSE for Testing Data: ", RMSE_DTR_test)

# Prediction with 10-Fold Cross Validation:
y_pred_cv_DTR = cross_val_predict(decision_tree_reg, X, y, cv=10)

# Find accuracy after 10-Fold Cross Validation
accuracy_cv_DTR = r2_score(y, y_pred_cv_DTR)
print("Accuracy for 10-Fold Cross Predicted Decision Tree Regression Model: ", accuracy_cv_DTR)

<a id="64"></a> <br>
### 6.4. Evaluating Random Forest Regression Model

In [None]:
# Prediction with training dataset:
y_pred_RFR_train = random_forest_reg.predict(x_train)

# Prediction with testing dataset:
y_pred_RFR_test = random_forest_reg.predict(x_test)

# Find training accuracy for this model:
accuracy_RFR_train = r2_score(y_train, y_pred_RFR_train)
print("Training Accuracy for Random Forest Regression Model: ", accuracy_RFR_train)

# Find testing accuracy for this model:
accuracy_RFR_test = r2_score(y_test, y_pred_RFR_test)
print("Testing Accuracy for Random Forest Regression Model: ", accuracy_RFR_test)

# Find RMSE for training data:
RMSE_RFR_train = sqrt(mean_squared_error(y_train, y_pred_RFR_train))
print("RMSE for Training Data: ", RMSE_RFR_train)

# Find RMSE for testing data:
RMSE_RFR_test = sqrt(mean_squared_error(y_test, y_pred_RFR_test))
print("RMSE for Testing Data: ", RMSE_RFR_test)

# Prediction with 10-Fold Cross Validation:
y_pred_cv_RFR = cross_val_predict(random_forest_reg, X, y, cv=10)

# Find accuracy after 10-Fold Cross Validation
accuracy_cv_RFR = r2_score(y, y_pred_cv_RFR)
print("Accuracy for 10-Fold Cross Predicted Random Forest Regression Model: ", accuracy_cv_RFR)

<a id="65"></a> <br>
### 6.5. Evaluating Support Vector Regression Model

In [None]:
# Prediction with training dataset:
y_pred_SVR_train = support_vector_reg.predict(x_train)

# Prediction with testing dataset:
y_pred_SVR_test = support_vector_reg.predict(x_test)

# Find training accuracy for this model:
accuracy_SVR_train = r2_score(y_train, y_pred_SVR_train)
print("Training Accuracy for Support Vector Regression Model: ", accuracy_SVR_train)

# Find testing accuracy for this model:
accuracy_SVR_test = r2_score(y_test, y_pred_SVR_test)
print("Testing Accuracy for Support Vector Regression Model: ", accuracy_SVR_test)

# Find RMSE for training data:
RMSE_SVR_train = sqrt(mean_squared_error(y_train, y_pred_SVR_train))
print("RMSE for Training Data: ", RMSE_SVR_train)

# Find RMSE for testing data:
RMSE_SVR_test = sqrt(mean_squared_error(y_test, y_pred_SVR_test))
print("RMSE for Testing Data: ", RMSE_SVR_test)

# Prediction with 10-Fold Cross Validation:
y_pred_cv_SVR = cross_val_predict(support_vector_reg, X, y, cv=10)

# Find accuracy after 10-Fold Cross Validation
accuracy_cv_SVR = r2_score(y, y_pred_cv_SVR)
print("Accuracy for 10-Fold Cross Predicted Support Vector Regression Model: ", accuracy_cv_SVR)

<a id="7"></a> <br>
## 7. EXPLORATION OF RESULTS

In general, let's put all the results of the models into the table:

In [None]:
training_accuracies = [accuracy_MLR_train, accuracy_PR_train, accuracy_DTR_train, accuracy_RFR_train, accuracy_SVR_train]
testing_accuracies = [accuracy_MLR_test, accuracy_PR_test, accuracy_DTR_test, accuracy_RFR_test, accuracy_SVR_test]
training_RMSE = [RMSE_MLR_train, RMSE_PR_train, RMSE_DTR_train, RMSE_RFR_train, RMSE_SVR_train]
testing_RMSE = [RMSE_MLR_test, RMSE_PR_test, RMSE_DTR_test, RMSE_RFR_test, RMSE_SVR_test]
cv_accuracies = [accuracy_cv_MLR, accuracy_cv_PR, accuracy_cv_DTR, accuracy_cv_RFR, accuracy_cv_SVR]
parameters = ["fit_intercept=False", "fit_intercept=False", "max_depth=5", "n_estimators=400, max_depth=5", "kernel=”linear”, C=1000"]

table_data = {"Parameters": parameters, "Training Accuracy": training_accuracies, "Testing Accuracy": testing_accuracies, 
              "Training RMSE": training_RMSE, "Testing RMSE": testing_RMSE, "10-Fold Score": cv_accuracies}
model_names = ["Multiple Linear Regression", "Polynomial Regression", "Decision Tree Regression", "Random Forest Regression", "Support Vector Regression"]

table_dataframe = pd.DataFrame(data=table_data, index=model_names)
table_dataframe

Now let's compare the training and testing accuracy of each model:

In [None]:
table_dataframe.iloc[:, 1:3].plot(kind="bar", ylim=[0.0, 1.0])

Let's compare each model's training and testing RMSE:

In [None]:
table_dataframe.iloc[:, 3:5].plot(kind="bar", ylim=[0, 8000])

Finally, compare the score values for 10-Fold Cross Validation:

In [None]:
table_dataframe.iloc[:, 5].plot(kind="bar", ylim=[0.0, 1.0])

As you can see, all models gave very close results. It is necessary that the values of training accuracy and testing accuracy should be close to each other. There may be many reasons why the results are not so high. First of all, we didn't do a proper feature engineering here and we didn't resort to methods like PCA (Principal Component Analysis). Moreover, the fact that the dataset is low can also be a factor.

<a id="8"></a> <br>
## 8. CONCLUSION

This kernel is designed to completely illustrate regression and EDA stages in machine learning. You can use the code and information in the example as desired. But please use his license to use the dataset. In this kernel, we tried to get information about:

* Data Science processes
* How a Dataset Exploratory Data Analysis (EDA) is made
* Find and visualize the correlation between features in data
* How to make various visualizations about dataset
* Use of libraries such as Pandas, Matplotlib, Seaborn and Scikit-Learn in Python
* How to do Data Cleaning simply
* How to split Dataset into Test and Train
* How to install and learn Machine Learning models
* How polynomial is made in Polynomial Regression
* How to do the Evaluating on the installed models
* How the R-Squared Score is located
* How RMSE values are located
* How K-Fold Cross Validation is done and how the score is calculated
* How the models evaluated are visually compared and interpreted

I didn't give a detailed explanation about each subject. You can find a detailed description on any topic on the internet or ask me to comment on it.

_Best regards..._<br>
__Mustafa YEMURAL__