### 📝 **Instructions to build the linear regression model in python**
#### US county-level sociodemographic and health resource data (2018-2019)
Sociodemographic and health resource data have been collected by county in the United States and we want to find out if there is any relationship between health resources and sociodemographic data.

To do this, you need to set a target variable (health-related) to conduct the analysis.

#### **Step 0: Import Libraries**

In [1]:
# Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# When you work locally it is likely to have an error with the SSL certification
# Recomend use request for read csv
import requests
from io import StringIO

from sklearn.preprocessing import (MinMaxScaler,
                                   StandardScaler,
                                   LabelEncoder,
                                   OneHotEncoder)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import (chi2,
                                       SelectKBest,
                                       f_regression)
from sklearn.model_selection import (train_test_split,
                                     GridSearchCV) # For Optimize
from sklearn.linear_model import (LogisticRegression,
                                  Lasso)
from sklearn.metrics import (accuracy_score,
                            confusion_matrix,
                            classification_report)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Optimize
from pickle import dump

#### **Step 1: Loading the CSV into python dataset**
The dataset can be found in this project folder under the name `demographic_health_data.csv`. You can load it into the code directly from the link:

```text
https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv
```

Or download it and add it by hand in your repository. In this dataset you will find a large number of variables, which you will find defined [here](https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/data_dict.csv).

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv")
df.head()

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,...,COPD_number,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code
0,1001,55601,6787,12.206615,7637,13.735364,6878,12.370281,7089,12.749771,...,3644,12.9,11.9,13.8,5462,3.1,2.9,3.3,1326,3
1,1003,218022,24757,11.355276,26913,12.344167,23579,10.814964,25213,11.564429,...,14692,12.0,11.0,13.1,20520,3.2,3.0,3.5,5479,4
2,1005,24881,2732,10.980266,2960,11.896628,3268,13.13452,3201,12.865239,...,2373,19.7,18.6,20.6,3870,4.5,4.2,4.8,887,6
3,1007,22400,2456,10.964286,2596,11.589286,3029,13.522321,3113,13.897321,...,1789,14.1,13.2,14.9,2511,3.3,3.1,3.6,595,2
4,1009,57840,7095,12.266598,7570,13.087828,6742,11.656293,6884,11.901798,...,4661,13.5,12.6,14.5,6017,3.4,3.2,3.7,1507,2


In [17]:
df['Heart disease_number']

0        3345
1       13414
2        2159
3        1533
4        4101
        ...  
3135     1862
3136      981
3137     1034
3138      500
3139      471
Name: Heart disease_number, Length: 3140, dtype: int64

In [3]:
# Create DB file. in. data./raw
df_raw = df.copy()
df_raw.to_csv("../data/raw/df_raw_RLR.csv", index= False)

In [9]:
display(df_raw.info())
display(df_raw.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Columns: 108 entries, fips to Urban_rural_code
dtypes: float64(61), int64(45), object(2)
memory usage: 2.6+ MB


None

Index(['fips', 'TOT_POP', '0-9', '0-9 y/o % of total pop', '19-Oct',
       '10-19 y/o % of total pop', '20-29', '20-29 y/o % of total pop',
       '30-39', '30-39 y/o % of total pop',
       ...
       'COPD_number', 'diabetes_prevalence', 'diabetes_Lower 95% CI',
       'diabetes_Upper 95% CI', 'diabetes_number', 'CKD_prevalence',
       'CKD_Lower 95% CI', 'CKD_Upper 95% CI', 'CKD_number',
       'Urban_rural_code'],
      dtype='object', length=108)

#### **Select varaible for analysis**
Exploring the dataset we identified 5 variables that could help us perform the analysis. The chosen variable will be `Heart disease_prevalence` because studies indicate that these types of diseases are the main causes of mortality in many regions.

#### **Step 2: Perform a full EDA**
This second step is vital to ensure that we keep the variables that are strictly necessary and eliminate those that are not relevant or do not provide information. Use the example Notebook we worked on and adapt it to this use case.

Be sure to conveniently divide the data set into train and test as we have seen in previous lessons.

In [18]:
# Step 2.1: Preprocessing data
df_interim = (
    df_raw
        .copy()
        .set_axis(
            df_raw.columns.str.replace(' ','_')
                          .str.replace('r/W', '', regex= True)
                          .str.lower()
                          .str.slice(0, 40), axis= 1
        )
        .drop_duplicates().reset_index(drop= True)
)
df_interim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Columns: 108 entries, fips to urban_rural_code
dtypes: float64(61), int64(45), object(2)
memory usage: 2.6+ MB


There are many columns, it is recommended to scale them and choose the variables with the greatest weight.There are many columns, it is recommended to scale them and choose the variables with the greatest weight

In [24]:
df_interim.rename(columns= {'heart_disease_number': 'target'}, inplace= True)
# Define the number columns without 'target'
data_types = df_interim.dtypes
num_columns = [c for c in list(data_types[data_types != "object"].index) if c != 'target']

# Apply StandardScaler only to numeric columns
scaler = StandardScaler()
norm_features = scaler.fit_transform(df_interim[num_columns])

# Create new DataFrame with scal number variables
df_interim_scal = pd.DataFrame(norm_features, index=df_interim.index, columns=num_columns)

# Insert column target
df_interim_scal['target'] = df_interim['target']

# Insert 'target' as the first column in df_interim_scal
df_interim_scal = df_interim_scal[['target'] + [col for col in df_interim_scal.columns if col != 'target']]

df_interim_scal.head()


Unnamed: 0,target,fips,tot_pop,0-9,0-9_y/o_%_of_total_pop,19-oct,10-19_y/o_%_of_total_pop,20-29,20-29_y/o_%_of_total_pop,30-39,...,copd_number,diabetes_prevalence,diabetes_lower_95%_ci,diabetes_upper_95%_ci,diabetes_number,ckd_prevalence,ckd_lower_95%_ci,ckd_upper_95%_ci,ckd_number,urban_rural_code
0,3345,-1.940874,-0.145679,-0.142421,0.158006,-0.135556,0.573496,-0.153144,0.02761,-0.139384,...,-0.1389,-0.063696,-0.07172,-0.089834,-0.129902,-0.609615,-0.582796,-0.669652,-0.147523,-1.082865
1,13414,-1.940742,0.341296,0.287476,-0.242861,0.320383,-0.193107,0.183774,-0.469965,0.23062,...,0.563986,-0.394103,-0.4149,-0.337677,0.376251,-0.433549,-0.393279,-0.343373,0.389791,-0.420704
2,2159,-1.94061,-0.237785,-0.239429,-0.419441,-0.246181,-0.439718,-0.225971,0.272104,-0.218759,...,-0.219763,2.432709,2.483064,2.317776,-0.183415,1.855312,1.880929,1.777443,-0.204321,0.903618
3,1533,-1.940478,-0.245223,-0.246032,-0.426966,-0.254791,-0.609076,-0.230792,0.396168,-0.220555,...,-0.256918,0.376846,0.423984,0.299632,-0.229096,-0.257483,-0.203761,-0.180233,-0.2421,-1.745026
4,4101,-1.940346,-0.138966,-0.135053,0.186249,-0.13714,0.216679,-0.155888,-0.200808,-0.14357,...,-0.074198,0.156575,0.195197,0.158008,-0.111247,-0.081417,-0.014244,-0.017093,-0.124105,-1.745026


In [None]:


# Create a new DataFrame with scaled nnumerical variables
df_interim_scal = pd.DataFrame(norm_features, index= df_interim, columns= num_cols)
df_interim_scal['target'] = df_interim['target']
df_interim_scal.head()

#### **Step 3: Build a linear regression model in python**
Start solving the problem by implementing a linear regression model and analyze the results. Then, using the same data and default attributes, build a Lasso model and compare the results with the baseline linear regression.

Analyze how R^2 evolves when the hyperparameter of the Lasso model changes (you can, for example, start testing from a value of 0.0 and work your way up to a value of 20). Draw these values in a line diagram.

#### **Step 4: Optimize the previous linear regression model using python**
After training the Lasso model, if the results are not satisfactory, optimize it using one of the techniques seen above.