# EX 5: Medical Insurance Prediction

**Objective**: Learn more about Linear Regression and its different training techniques.
    
You will be using a medical insurance dataset.
Below are some informations to help **frame the problem**:

For a health insurance company to make money, it needs to collect more in yearly premiums than it spends on medical care to its beneﬁciaries.

As a result, insurers invest a great deal of time and money in developing models that accurately forecast medical expenses for the insured population.

Medical expenses are diﬃcult to estimate because the costliest conditions are rare and seemingly random.

Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.

The goal of this analysis is to use patient data to estimate the average medical care expenses for such population segments.

These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs.


## Collecting data

For this analysis, we will use a simulated dataset containing hypothetical medical expenses for patients in the United States.

This data was created using demographic statistics from the US Census Bureau, and approximately reﬂect real-world conditions.

The data includes **1,338 examples** of beneﬁciaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year. There are 6 features:

    • age: An integer indicating the age of the primary beneﬁciary (excluding those above 64 years, since they are generally covered by the government).

    • sex: The policy holder’s gender, either male or female.

    • bmi: The body mass index (BMI), which provides a sense of how over- or under-weight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9.

    • children: An integer indicating the number of children/dependents covered by the insurance plan.

    • smoker: A yes or no categorical variable that indicates whether the insured regularly smokes tobacco.

    • region: The beneﬁciary’s place of residence in the US,divided into four geographic regions: northeast, southeast, southwest, or northwest.

It is important to give some thought to how these variables may be related to billed medical expenses.

For instance, we might expect that older people and smokers are at higher risk of large medical expenses. Unlike many other machine learning methods, in regression analysis, the relationships among the features are typically speciﬁed by the user rather than being detected automatically.

Let’s explore some of these potential relationships.


## Exploring and preparing the data

1. Load the data for analysis.
2. Show the structure of the data.
3. What feature is the dependent feature and which ones are the independent features? What is the type of each feature in the data?
4. How to analysis the dependent feature and why using this method?
5. Give two methods that describe if the dependent variable is normally distributed? Explain the distribution of the dependent feature.
6. Which of the independent features are categoricals in the data? Look into the distribution of each one. Explain the results.


## Exploring relationships among features

Before ﬁtting a regression model to data, it can be useful to determine how the independent variables are related to the dependent variable and each other.

A correlation matrix provides a quick overview of these relationships.Given a set of variables, it provides a correlation for each pairwise relationship.

   7. Create a correlation matrix for the four numeric variables in the insurance data frame.
   
   8. Is there any interesting relationship between features? Explain those associations


## Visualizing relationships among features

It can also be helpful to visualize the relationships among numeric features by using a scatterplot.

Although we could create a scatterplot for each possible relationship, doing so for a large number of features might become hard.

An alternative is to create a scatterplot matrix, which is simply a collection of scatterplots arranged in a grid.

It is used to detect patterns among three or more variables. The scatterplot matrix is not a true multidimensional visualization because only two features are examined at a time.

Still, it provides a general sense of how the data may be interrelated.

9. Use the scatter_matrix() function to create a scatterplot matrix for the four numeric features: age, bmi, children, and charges.

10. Do you notice any patterns in these plots? Explain.

11. To enhance the previous scatterplot matrix create the new one using the seaborn or plotly libraries. Do you notice any new patterns in these new graphics?

## Dummy coding

To be able to use all variables in our regression model, we need to transform the categorical variables into numerical.
one way is by transforming every categorical variable into dummy variable.
Dummy coding is a commonly used method for converting a categorical variable into continuous variable.
Dummy, as the name suggests is a duplicate variable which represents one level of a categorical variable.
Presence of a level is represent by 1 and absence is represented by 0.
For every level present, one dummy variable will be created.
Look at the representation below to convert a categorical variable using dummy variable.

**data=pd.get_dummies(df, dtype=int)**

# Training a model on the data

To ﬁt a linear regression model to data with Python, the OLS function can be used. This is included in the statsmodels.api package, which you will need to import as sm. Use the OLS() syntax, to ﬁt a linear regression model to this data( Use all the available features).


## 1. statsmodels.api

Let's use the stats models package.

12. Use the OLS() syntax provided in here: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html, to ﬁt a linear regression model to this data( Use all the available features). Name it “model”.

13. Show the estimated beta coeﬃcients. Explain the results. Why there are 11 coeﬃcients reported in addition to the intercept not 6?

14. What increase, or decrease will aﬀect the medical expenses for each additional child? or additional year of age? or unit increase of BMI?

15. Explain the eﬀect of medical expenses in numbers on each one of the 3 categorical features.

16. To evaluate the model performance, use the summary () function on the stored model.

17. Given the preceding two performance indicators, How the model is performing? explain.

## 2. Scikit learn


Let's use the scikitLearn library.


18. Split the data between training and testing.


19. Use the scikitLearn library to implement the **LinearRegression** function, from sklearn.linear_model, based on the Normal equation to find the value of the weights that minimizes the MSE. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

20. Use the scikitLearn library to implement the **SGDRegressor** function, from sklearn.linear_model, based on the stochastic Gradient descent to find the value of the weights that minimizes the MSE. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#

21. Use SGDRegressor function, from sklearn.linear_model, to implement a similation of the Mini batch Gradient descent.


22. What are the key hyperparameters of each implementation?

23. Find different metrics to evaluate your regression model performance.