## CONTENT:
* i. Overview
* ii)Explanation of Features
* iii) Used libraries
* iv)Loading and copy the Data
* Exploratory Data Analysis
* vi)Print table which contain statistical data of the dataset
* vii)We will see NaN value and get to know colums of data
### VISUALIZATION STEPS:
* STEP-1:Correlation of Columns(Attributes) and Heatmap:
* STEP-2:The distribution of smokers and non-smokers in the BMI and Charges Scatter Plot
* STEP3:Examining the Relationship of Charges to the Categorical Features
* STEP-4:Number of people paying x amount for each charges category
* STEP-5:Relation between age and charges
* STEP-6:Relation between number of children and charges
* STEP-7:Relation between age and bmi
* STEP-8:Relation between sex and bmi
* STEP-9:Relation between children and bmi
* Conclusions

# i. Overview
This data analysis aims to explore the factors affecting the medical costs

# ii)Explanation of Features

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.


* charges: Individual medical costs billed by health insurance

## iii) Used libraries
Here are the libraries we will use:
* 1)NumPy
* 2)Pandas(for data manipulation)
* 3)Matplotlib(for data visualization)
* 4)Seaborn(for data visualization)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns; sns.set()

# Disabling warnings
import warnings
warnings.simplefilter("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# ## iv)Loading and copy the Data[](http://)

In [None]:
data=pd.read_csv("/kaggle/input/insurance/insurance.csv")

In [None]:
df=data.copy()

## v) Exploratory Data Analysis
We will see first and last rows of data.

In [None]:
display(df.head())
display(df.tail())

In [None]:
df.info()

In [None]:
sex = df.groupby('sex').size()
print(sex)
smoker = df.groupby('smoker').size()
print(smoker)
region = df.groupby('region').size()
print(region)
#The data is very much balanced between sex and region. 
#On the other hand, non-smokers outnumber the smokers.

## vi)Print table which contain statistical data of the dataset

In [None]:
df.describe()

In [None]:
df.isnull().sum() #There is not NaN value

## vii)We will see NaN value and get to know colums of data

In [None]:
region_list=list(df['region'].unique())
region_list  #There are 4 region in data.

In [None]:
children_list=list(df['children'].unique())
children_list #There are 6 differents options of children.

# VISUALIZATION STEPS:

> ## STEP-1:Correlation of Columns(Attributes) and Heatmap:
* In this section, we'll find the correlation between the columns and we'll visualize it into a Heatmap. In this way, we will be able to see the relationship between the attributes 

In [None]:
df.corr()  # Prints correlation for the numerical columns.


* Correlation is a number that indicates how the two attributes are related to each other. As this number approaches 1.0, the relationship is strengthened in the right direction. As it approaches -1.0, it is strengthened in the opposite direction. If this value is close to zero, the bond between the two data is weak. 

* For example in the above matrix, we see a little (but no more) bound with person's age and charge values. Other bounds are so weak. Now we visualize this correlation matrix with Heatmap:

In [None]:
f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="red", fmt= '.2f',ax=ax)
plt.show()

Graphic Interpretation:
* We can see that the correlation between the data is positive but weak. The highest correlation is between medical charges and age is (0.3), which is not that big either.

## STEP-2:The distribution of smokers and non-smokers in the BMI and Charges Scatter Plot:
* Kutle endeksiyle sigorta ucreti arasindaki iliskide sigara icenlerin dagilimini incelersek degisik bir gorsel onumuzu cikiyor 

In [None]:
sns.scatterplot(x="bmi", y="charges", data=df) 
#BMI ile ucret arasindaki ilk gorunum.

In [None]:
sns.set(style = "ticks")
sns.pairplot(df, hue = "smoker")

In [None]:
sns.scatterplot(x="bmi", y="charges", data=df, hue='smoker')

In [None]:
sns.lmplot(x = "bmi", y = "charges", hue="smoker",data = df);

In [None]:
df.groupby('smoker')[['charges','bmi']].corr() 
#sigara icenlerin koralasyonunun daha yuksek oldugunu goruyoruz

Graphic Interpretation:
* When we divide beneficiaries into two groups as smokers and non-smokers and recalculate the correlation between bmi and charges, this time we come up with the score of 0.80 and this is a significant value since it is >0.5
* charges vs bmi - BMIs greater than 30 is considered obesed. The chart shows a group of individuals with BMI > 30 are charged higher.(Grafikte 30'dan büyük BMI'ler obez kabul edersek. Grafikte BMI> 30 olan bir grup birey daha yüksek ücretlendirilmektedir.)

## STEP3:Examining the Relationship of Charges to the Categorical Features

Let us first examine the distribution of charges.

In [None]:
## check the distribution of charges
distPlot = sns.distplot(df['charges'])
plt.title("Distirbution of Charges")
plt.show(distPlot)

Graphic Interpretation:
* The graph shows it is skewed to the right. We can tell visually that there may be outliers (the maximum charge is at $63,770). Let us examine again this time between the groups.

In [None]:
# 1) Charges Between Gender
meanGender = data.groupby(by = "sex")["charges"].mean()
print(meanGender)
sns.violinplot(x = "sex", y = "charges", data = df);

Graphic Interpretation:
* There is not much difference between gender based on the violin plot. For males, the average charge is "slightly" higher compared to female counterparts with the difference of around $1387.

In [None]:
# 2) Charges between Smokers and non-Smokers
meanSmoker = data.groupby(by = "smoker")["charges"].mean()
print(meanSmoker)
print(meanSmoker["yes"] - meanSmoker["no"])
sns.violinplot(x = "smoker", y = "charges", data = df);

Graphic Interpretation:

* so there's around $23,615 difference between smokers and non-smokers. Smoking is very expensive indeed.

In [None]:
#3)Charges Among Regions
meanRegion = data.groupby(by = "region")["charges"].mean()
print(meanRegion)
sns.violinplot(x = "region", y = "charges", data = df);

In [None]:
labels = df.groupby('region').mean().index
colors = ['grey','blue','red','yellow']
explode = [0.02,0.02,0.2,0.02]
sizes = df.groupby('region').sum()['charges']
plt.figure(figsize = (7,7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.2f%%')
plt.title('Region - Charges',color = 'blue',fontsize = 15);

Graphic Interpretation: 
* Region groups also does not show much difference between them based on the plot.The individuals from the Southeast has charged more on there bills. The highest charged individual also lives in the region as shown in the chart.

In [None]:
#4)The following shows the relationship of the medical charges to other
#numerical variables.
pairPlot = sns.pairplot(df)

In [None]:
sns.set(style = "ticks")
sns.pairplot(df, hue = "smoker")

* Focusing again on the first 3 charts in the bottom row, we can say that the higher amount of charges are dominated by blue points which are represented by smokers.

## STEP-4:Number of people paying x amount for each charges category.

In [None]:
#Creating another column containing bins of charges
df['charges_bins'] = pd.cut(df['charges'], bins=[0, 15000, 30000, 45000, 60000, 75000])

df.head()

In [None]:
#Creating a countplot based on the amount of charges
plt.figure(figsize=(12,4))
sns.countplot(x='charges_bins',data=df) 
plt.title('Number of pepople paying x amount\n for each charges category', size='23')
plt.xticks(rotation='25')
plt.ylabel('Count',size=18)
plt.xlabel('Charges',size=18)
plt.show()

Graphic Interpretation:
* As we can see, most of the people pay less than 15k for medical costs

## STEP-5:Relation between age and charges:

In [None]:
#Making bins for the ages
df['age_bins'] = pd.cut(df['age'], bins = [0, 20, 35, 50, 70])

#Creating boxplots based on the amount of different age categories
plt.figure(figsize=(12,4))
sns.boxplot(x='age_bins', y='charges', data=df) 
plt.title('Charges according to age categories', size='23')
plt.xticks(rotation='25')
plt.grid(True)
plt.ylabel('Charges',size=18)
plt.xlabel('Age',size=18)
plt.show()

Graphic Interpretation:
* We can still see that older people pay more for their health charges. Individuals between 50-70 years old pay the most.

## STEP-6:Relation between number of children and charges:

In [None]:
#Countplot for different 'number of children' categories
plt.figure(figsize=(12,4))
sns.countplot(x='children', data=df) 
plt.title('Number of pepople having x children', size='23')
plt.ylabel('Count',size=18)
plt.xlabel('Number of children',size=18)
plt.show()

Graphic Interpretation:
* most of our individuals don't have any children!We can notice that each time we increase the number of children by 1 child the count of individuals decreases. Maybe having many children isn't a trend nowadays!

In [None]:
#Charges according to number of children
#Creating a violinplot for each category
plt.figure(figsize=(12,4))
sns.violinplot(x='children', y='charges', data=df, hue='sex')
plt.title('Charges according to number of children', size='23')
plt.ylabel('Charges',size=18)
plt.xlabel('Number of children',size=18)
plt.show()

Graphic Interpretation:
* As we can see, almost all categories have the same range and mean of costs also the distributions are very similar, except for the people who have 5 children. This might be because of the small size of the sample of this kind of people!


## STEP-7:Relation between age and bmi

In [None]:
sns.lineplot(x="age", y='bmi',data = df);

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(x='age',y='bmi', data=df);

In [None]:
sns.lmplot(x = "age", y = "bmi",data = df);

## STEP-8:Relation between sex and bmi:

In [None]:
plt.figure(figsize=(40,54))
sns.catplot(x='sex', y="bmi", kind="box",data=df);
plt.xticks(rotation=30);

Graphic Interpretation:
* male and female have almost the same bmi.Male is a little bit greater.

## STEP-9:Relation between children and bmi

In [None]:
plt.figure(figsize=(40,54))
sns.catplot(x='children', y="bmi", kind="box",hue="sex",data=df);

## Conclusions:[](http://)

After analysing all the relations between the diffrent variables and the 'charges' variable, we got diffrent results for each feature:

* age: this variable has an impact on the charges, when a person is older the health costs are larger.(Kisi yaslandiginda ucretler artiyor)
* sex: the 'sex' variable doesn't affect the charges variable, it doesn't matter if you are a men or a women your health bills won't change.(etkisi yok)
* bmi: for the BMI we found out after we grouped it to diffrent classes that when the weight increases, the health care charges increase along.(BMI için, ağırlık arttığında sağlık ücretlerinin arttığını farklı sınıflara grupladıktan sonra öğrendik(smoker).)
* children: the number of children doesn't affect the medical costs billed by health insurance.(etkisi yok)
* smoker: if you are a smoker you must expect some huge medical charges compared to non-smokers. Especially for people who have high BMI values (>35) it will result very serious health care charges.(sigara icenler daha cok odeme yapiyor ozellikle bmi>35 olanlar daha cok)
* region: no matter where you live, this won't have any impact on your medical insurance bills.(etkisi yok)