## 1. Imports and datasets

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
from sklearn.linear_model import LinearRegression

In [None]:
df= pd.read_csv('../input/insurance/insurance.csv')

In [None]:
df.head()

## 2. Data Wrangling

First of all, we will observe the data and see if there are missing values

In [None]:
df.isnull().sum()

Hopefully, there are not any NaN values, so we are going to transform some categorical values like sex and smoker to numerical values.

**In the case of sex:**

 0 = Female
 
 1 = Male
 
 
**In the case of smoker:**


 0 = No smoker
 
 
 1 = Smoker

In [None]:
df['sex'].replace('female',0,inplace = True)
df['sex'].replace('male',1,inplace = True)

In [None]:
df['smoker'].replace('yes',1,inplace = True)
df['smoker'].replace('no',0,inplace = True)

In [None]:
df.head()

Now, the data is in the correct format.

In [None]:
df.dtypes

The data type makes sense too, so the data wranlging stage is ended.

## 3. Exploratory Data Analysis

Now, using the libraries that we import, we will visualize the data and take some information before making the formal prediction model.

In [None]:
df.describe()

Describing the dataset we can get some usefull information.

    1. Only the 20.4% of the people smokes
    2. The number of male/female is practically the same
Before a first look, we're going to see the correlation between the features, and see which are the most relevant.

In [None]:
df.corr()['charges'].sort_values()

Wow, as we see, the most important feature when we talk about the charge is to be smoker or not, followed by (logically) the age. So, there are two more conclusions:

    1. If you smoke, you will have to pay more.
    2. An older you get, the more you have to pay.

In [None]:
df.groupby("smoker").mean()

WOW, Stuning, the people who smoke on average pay 3.5 TIMES MORE of health insurance. By the other hand, we dont see any relevant differences between the people who smokes and not


Let's check the distribution of the age and sex of the people who smokes.

In [None]:
f= plt.figure(figsize=(12,5))
ax=f.add_subplot(121)
sns.distplot(df[(df.smoker == 1)]['age'],color='b',ax=ax)

Well, another prove that it's more common for young people to smoke. 

In [None]:
sns.catplot(x="smoker", kind="count",hue = 'sex', palette="Paired", data=df)

As said before, It is more common for women to smoke more than men.

In [None]:
sns.distplot(df[(df.smoker == 0)]["charges"],color='g').set(title = 'Insurance cost comparation between smoking or not')
sns.distplot(df[(df.smoker == 1)]["charges"],color='b').set(title = 'Insurance cost comparation between smoking or not')

As we see it before, the insurance cost of the smokers is much bigger.

At the moment we have see the most relevant data, now were going to take all together.

In [None]:
df_gptest = df[['sex','smoker','charges','age']]
grouped_test1 = df_gptest.groupby(['sex','smoker'],as_index=False).mean()
grouped_test1

In [None]:
sns.lmplot(x="age", y="charges", hue="smoker", data=df, palette = 'Paired', height = 7)
ax.set_title('Smokers and non-smokers')

Now we clarely see the importance of the age, hopessly, the older you get, the charge will be bigger.

### After analyzing the data we have come to some simple conclusions, now it's time to develop a proper prediction algorithm.

## 4. Model Development

In this particular dataset, we want to predict a continous value, so we're in front of a **Linear Regression** problem.

<p>When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using <b>regression plots</b>.</p>

<p>This plot will show a combination of a scattered data points (a <b>scatter plot</b>), as well as the fitted <b>linear regression</b> line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).</p>

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="age", y="charges", data=df)
plt.ylim(0,)

As wee see, there's a positive relationship between the age and charges.

<p>From the previous section  we know that other good predictors of charges could be:</p>
<ul>
    <li>Smoker</li>
    <li>Age</li>
    <li>Sex</li>
    <li>BMI</li>
</ul>
Let's develop a model using these variables as the predictor variables.



First, were going to create a Linear Regression object, using SciPy libraries, split the dataset and fit it.

### 4.1 First try using normal linear regression (No polynomial)

In [None]:
from sklearn.model_selection import train_test_split
y_data = df['charges']
x_data = df.drop('charges',axis=1) #All the data except the one we want to predict it.

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])



In [None]:
ln = LinearRegression()

In [None]:
ln.fit(x_train[['age', 'sex', 'bmi','children','smoker']], y_train)

In [None]:
yhat_train = ln.predict(x_train[['age', 'sex', 'bmi','children','smoker']])

In [None]:
yhat_test = ln.predict(x_test[['age', 'sex', 'bmi','children','smoker']])

Let's perform some model evaluation using our training and testing data separately. First  we import the seaborn and matplotlibb library for plotting. I've take from the internet the distribution plot.

In [None]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Charge (in dollars)')
    plt.ylabel('Provided Features')

    plt.show()
    plt.close()

In [None]:
Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)


As we see, the model works great for little charge values, but has some problem when fitting bigger amounts.

In [None]:
ln.score(x_train[['age', 'sex', 'bmi','children','smoker']], y_train)

In [None]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

In [None]:
ln.score(x_test[['age', 'sex', 'bmi','children','smoker']], y_test) #Not the best score...

On the test, same problem, little values okay, but does not work with large values.

### 4.2 Time to try using polynomial regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
Rsqu_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train[['age', 'sex', 'bmi','children','smoker']])
    
    x_test_pr = pr.fit_transform(x_test[['age', 'sex', 'bmi','children','smoker']])    
    
    ln.fit(x_train_pr, y_train)
    
    Rsqu_test.append(ln.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')    

It's a little complicated to explain, but I'll try my best. Depending on the grade of the polynom, we can get higher R-score, which is the model "accuracy" score, so on, we try different grades, if we use a too high degree,  we will overfit, so i check the best R^2 score on the test set.

In [None]:
pr = PolynomialFeatures(degree=3)
x_train_pr = pr.fit_transform(x_train[['age', 'sex', 'bmi','children','smoker']])
x_test_pr = pr.fit_transform(x_test[['age', 'sex', 'bmi','children','smoker']])


In [None]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train)
yhat_train = poly.predict(x_train_pr)
yhat_test = poly.predict(x_test_pr)

In [None]:
Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)

In [None]:
poly.score(x_train_pr, y_train)

On the training set, the prediction doest not look good... but on the test set...

In [None]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

We have improved much our model!!!1

In [None]:
poly.score(x_test_pr, y_test)

## So, that's all my friends! Hope you like my FIRST KAGGLE submission, in some time i will add more explanation about the code 