In [22]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Diabetes in America

Diabetes is an enormous public health problem as it probably affects more than 9% of U.S. adults

The goal of this project is to build a machine learning model that estimates the risk of having diabetes based of a few simple variables (income, education, age, bmi, etc).

## The BRFSS survey

The Behavioral Risk Factor Surveillance System (BRFSS) survey asks the participants a number of questions regarding health and health-related behaviours.
Of the many questions asked of the participants, one asks whether the respondent has diabetes.

Navigate to the [Behavioral Risk Factor Surveillance System data portal](https://www.cdc.gov/brfss/annual_data/annual_Data.htm), and download the 2019 BRFSS Survey [Data](https://www.cdc.gov/brfss/annual_data/2019/files/LLCP2019ASC.zip).
Unzip the file LLCP2019.ASC, and place it in your data folder.

The following cell will display the first row of the data file:

In [1]:
path = r'Data\LLCP2019.ASC'
with open(path) as file:
    first_line = file.readline()
    print(first_line)

01              0101182019     11002019000001                 11 121 0120001              2         31588881121112112222 222223  1121207                                    232        12127880301540502 22212213 073888      2                2101025553022012033152      412      213                                                                                                 134 42           22221131111                                                                                                                                                                                                                                                                           1001                                                                                                                                                                                                                                                                                                                                     

Each record in the BRFSS file is a string without delimiters to identify variables (i.e., the file format is *fixed-width*).
Variables are located at established positions in the string.
The [codebook](https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML) describes the variables and field positions.

The following table contains the positions for several variables

| Variable | Start | End |
| :- | :-: | :-: |
| Diabetes | 127 | 127
| General Health | 101 | 101
| Education Level | 174 | 174
| Employment Status | 188 | 188
| Income Level | 191 | 192
| Weight (in Pounds) | 193 | 196
| Height in (ft/inches) | 197 | 200
| Frequency of  Smoking | 209 | 209
| Alcohol Consumption | 217 | 218
| Body Mass Index (BMI) | 1998 | 2001

Don't forget that Python uses zero-indexing to reference characters in a string, so you will have to adjust the values in the table accordingly.

# Step 1: Create a Dataframe