In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Healthcare Analytics: Identifying At-Risk Individuals

Diabetes is an enormous public health problem as it probably affects more than 10% of U.S. adults.

The Behavioral Risk Factor Surveillance System (BRFSS) survey asks the participants a number of questions regarding health and health-related behaviours.
Of the many questions asked of the participants, one asks whether the respondent has diabetes.

**Part 1:** Navigate to the [Behavioral Risk Factor Surveillance System data portal](https://www.cdc.gov/brfss/annual_data/annual_Data.htm), and download the 2021 BRFSS Survey [Data](https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020ASC.zip).
Unzip the file LLCP2020.ASC, and place it in your data folder.

The following cell will display the first row of the data file

In [1]:
path = 'LLCP2021.ASC_' # path to the file
with open(path) as file: # open file
    for line in file: # loop over document lines
        print(line) # print line
        break # stop the for loop

01              0101192021     11002021000001                 11 121 02 01012             2         52010880312223 21122211221223  1222108                                    141        111278805 007204112222211333888      10920200112      10155520420320110123                                                                                                              2         22222111111153                                                                                                                                                                                                                                                                      1001                                                                                                                                                                                                                                                                                                                                                     

Each record (row) in the BRFSS file is a string without delimiters to identify variables (i.e., the file format is *fixed-width*).
Variables are located at established positions in the string.
The [codebook](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf) describes the variables and field positions.

For example, according to the codebook (page 29), the answer to *(Ever told) you had diabetes?* is contained in column 126.

![diabetes](foto4.png)

For record 1, we have (don't forget that Python uses zero-indexing to reference characters in a string)

In [4]:
# (Ever told) you had diabetes? Yes
line[125]

'1'

Another example. According to the codebook (page 36), the *income level* is contained in columns 185-186.

![income](foto5.png)

For record 1, we have

In [6]:
# income level: less than $10,000
line[184:186]

'01'

The **goal** of this problem is to build a `KNeighborsClassifier` model that estimates the risk of having diabetes based of a few simple variables.
Your model must use income, education, age, gender and bmi data, and other features you might think are relevant.

The following table contains the positions for several variables

| Variable | Start | End |
| :- | :-: | :-: |
| Diabetes | 126 | 126 |
| General health | 101 | 101 |
| Education level | 168 | 168 |
| Employment status | 182 | 182 |
| Income level | 185 | 186 |
| Weight (in Pounds) | 188 | 191 |
| Height (in feet and inches) | 192 | 195 |
| Smoking Status | 2006 | 2006 | 
| Alcohol consumption (drinks per week) | 2013 | 2017 |
| Heavy drinkers | 2019 | 2019 |
| Body Mass Index (BMI) | 1997 | 2000 |
| Reported age (in five-year age categories) | 1980 | 1981 |
| Sex | 1979 | 1979 |
| Metropolitan Status | 1402 | 1402 |

**Part 2:** The code below reads the document line by line, extracting diabetes and income data, and puts the data into a pandas dataframe.
Your task is to modify the code so that it extracs all the data your model will need (income, education, age, bmi, etc).

In [3]:
data_dict = {} # initialize dictionary
idx = 0 # counter that keeps track in which line we are
with open(path) as file: # open the file
    for line in file: # loop over lines
            
        # extract income level data 
        income = line[184:186] # columns 185-186
        if income in ['77','99','  ']: # don't know, refused or missing
            income = np.nan
        else:
            income = int(income)
            
            
        # extract diabetes data
        diabetes = line[125] # column 126
        if diabetes == '1': # yes
            diabetes = 1 
        elif diabetes in ['2','3','4']: # no
            diabetes = 0 
        else: # don't know, refused or missing value
            diabetes = np.nan
            
        
        # Dictionary
        data_dict[idx] = {
                        'income' : income,
                        'diabetes' : diabetes
                         }
        idx+=1 # increment counter by 1
        
# put data into a dataframe
data = pd.DataFrame.from_dict(data_dict, orient='index')
data

Unnamed: 0,income,diabetes
0,1.0,1.0
1,,0.0
2,7.0,0.0
3,,0.0
4,,0.0
...,...,...
401953,,0.0
401954,4.0,0.0
401955,1.0,0.0
401956,,0.0
