# Load and Check Data ðŸ§” <a href="1"></a>
<img src="https://i1.wp.com/edulastic.com/wp-content/uploads/sites/2/2017/04/ebookdribbble7.gif?fit=470%2C353&ssl=1" width="100%"/>

In [None]:
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

plt.style.use("seaborn-whitegrid")
warnings.filterwarnings("ignore")

In [None]:
# load data
data = "../input/insurance/insurance.csv"
df = pd.read_csv(data)

# show data (6 row)
df.head(6)

## Variable Description

**Columns:** <br>
- **age:** age of primary beneficiary
- **sex:** insurance contractor gender, female, male
- **bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- **children:** Number of children covered by health insurance / Number of dependents
- **smoker:** Smoking
- **region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- **charges:** Individual medical costs billed by health insurance

In [None]:
df.describe().T

- We see that the average age of the people included in the data is 39. The standard deviation of the age value is 14. That is, the age distribution is generally between 25 and 64. When we examine the Mix and Max points, we see that there are outliers in the data.
- According to the BMI value, the number of obese and overweight people is high in this data set.
- In general, the majority of people in the dataset have at least 1 child.

In [None]:
df.info()

## Missing Value

We may get wrong results when implementing machine learning due to missing data in the dataset. Therefore, we need to detect and adjust the missing data in the data set.

In [None]:
df.columns[df.isnull().any()]

In [None]:
df.isnull().sum()

Our above-seen dataset is excellent. There is no missing data in our data set.

# Data Preparation 
## Inconsistent Observation
95% of a machine learning model is said to be preprocessing and 5% is model selection. For this we need to teach the data to the model correctly. In order to prepare the available data for machine learning, we must apply certain pre-processing methods. One of these methods is the analysis of outliers. The outlier is any data point that is substantially different from the rest of the observations in a data set. In other words, it is the observation that goes far beyond the general trend.
<img src="https://miro.medium.com/max/854/1*RW-vfIbKZh-UGsLfTAWpyw.png" />

Outlier values behave differently from other data models and they increase the error with overfitting, so the outlier model must be detected and some operations must be performed on it.


We can see contradictory observations with many visualization techniques. One of them is the box chart. If there is an outlier, this is drawn as the point, but the other population is grouped together and displayed in boxes.

In [None]:
data = df.copy()
data = data.select_dtypes(include=["float64","int64"])
data.head()

In [None]:
column_list = ['age', 'bmi', 'children', 'charges']
for col in column_list:
    sns.boxplot(x = data[col])
    plt.xlabel(col)
    plt.show()

When the charts above are examined, it is seen that there are outliers in bmi and charges values. However, these outliers do not harm our data set. On the contrary, these data allow us to comment on the data more easily. Therefore, we do not apply any process to this data.

# Linear Regressions

<img src="https://cdn.lynda.com/course/645049/645049-637286229812196095-16x9.jpg" />

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.

y = b0 + b1*x
b0 = constant
b1 = coeff
x = value

### Advantages
- Linear Regression is simple to implement and easier to interpret the output coefficients.
- When you know the relationship between the independent and dependent variable have a linear relationship, this algorithm is the best to use because of itâ€™s less complexity to compared to other algorithms.
- Linear Regression is susceptible to over-fitting but it can be avoided using some dimensionality reduction techniques, regularization (L1 and L2) techniques and cross-validation.

### Disadvantages
- On the other hand in linear regression technique outliers can have huge effects on the regression and boundaries are linear in this technique.
- Diversely, linear regression assumes a linear relationship between dependent and independent variables. That means it assumes that there is a straight-line relationship between them. It assumes independence between attributes.
- But then linear regression also looks at a relationship between the mean of the dependent variables and the independent variables. Just as the mean is not a complete description of a single variable, linear regression is not a complete description of relationships among variables.


### Linear Regression Usage Areas
Linear Regression is a rather ubiquitous curve fitting and machine learning technique thatâ€™s used everywhere from scientific research teams to stock markets. Some uses:
- Studying engine performance from test data in automobiles
- Least squares regression is used to model causal relationships between parameters in biological systems
- OLS regression can be used in weather data analysis
- Linear regression can be used in market research studies and customer survey results analysis
- Linear regression is used in observational astronomy commonly enough. A number of statistical tools and methods are used in astronomical data analysis, and there are entire libraries in languages like Python meant to do data analysis in astrophysics.

In [None]:
f= plt.figure(figsize=(16,5))

ax=f.add_subplot(121)
sns.distplot(df['charges'],bins=50,color='r',ax=ax)
ax.set_title('Distribution of insurance charges')

ax=f.add_subplot(122)
sns.distplot(np.log10(df['charges']),bins=40,color='b',ax=ax)

plt.show()

In [None]:
f = plt.figure(figsize=(14,6))
ax = f.add_subplot(121)
sns.violinplot(x='sex', y='charges',data=df,palette='Wistia',ax=ax)
ax.set_title('Violin plot of Charges vs sex')

ax = f.add_subplot(122)
sns.violinplot(x='smoker', y='charges',data=df,palette='magma',ax=ax)

plt.show()

In [None]:
sns.jointplot(x="bmi",y="charges",data=df,kind="reg")
plt.show()

In [None]:
sns.jointplot(x="age",y="charges",data=df,kind="reg")
plt.show()

In [None]:
sns.jointplot(x="children",y="charges",data=df,kind="reg")
plt.show()

In [None]:
# import the necessary packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import boxcox
from sklearn import metrics

df_encode = df.copy()

The Dummy variable trap is a scenario in which the independent variable are multicollinear, a scenario in which two or more variables are highly correlated in simple term one variable can be predicted from the others.

By using pandas get_dummies function we can do all above three step in line of code. We will this fuction to get dummy variable for sex, children,smoker,region features. By setting drop_first =True function will remove dummy variable trap by droping one variable and original variable.The pandas makes our life easy.

In [None]:
df_encode = pd.get_dummies(data = df_encode, columns = ['sex','smoker','region'])
df_encode.head()

In [None]:
# normalization
y_bc,lam, ci= boxcox(df_encode['charges'],alpha=0.05)
df_encode['charges'] = np.log(df_encode['charges'])

df_encode.head()

<img src="https://cdn.lynda.com/course/721905/721905-637286247907951322-16x9.jpg" />

**Why is normalization done?** <br>
Normalization has two main purposes. Eliminating data duplication in the database and increasing data consistency (accuracy).

Normalization is applied to databases with levels (normal forms). In order to be able to say that a database is suitable for any of these normal forms, it must fulfill all the criteria of the normal form in question.

When successfully implemented, the normalization process greatly increases the speed of the database.

In [None]:
X = df_encode.drop('charges',axis=1) 
y = df_encode['charges']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=23)

In [None]:
X = df_encode['bmi'].values.reshape(-1,1)  # Independet variable
y = df_encode['charges'] # dependent variable

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)
lin_reg = LinearRegression()
model = lin_reg.fit(X_train,y_train)
predictions = lin_reg.predict(X_test)

print("intercept: ", model.intercept_)
print("coef: ", model.coef_)
print("RScore. ", model.score(X_test,y_test))

In [None]:
plt.figure(figsize=(12,6))
plt.scatter(y_test,predictions)
plt.show()

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

In [None]:
plt.figure(figsize=(12,6))
g = sns.regplot(x=df_encode['bmi'],y=df_encode["charges"],ci=None,scatter_kws = {'color':'r','s':9})
g.set_title("Model Equation")
g.set_ylabel("charges")
g.set_xlabel('bmi')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
g = sns.regplot(x=df_encode['age'],y=df_encode["charges"],ci=None,scatter_kws = {'color':'r','s':9})
g.set_title("Model Equation")
g.set_ylabel("charges")
g.set_xlabel('age')
plt.show()