# Project: U.S. Medical Costs

In this project, we will investigate smoking in the United States. We will use data and directions provided by Codecademy; the original data can be found [here](https://www.kaggle.com/mirichoi0218/insurance). The analysis will not include visualization or additional libraries (other than ```csv```, to import the data). The purpose of this project is to cement basic Python skills, including: data processing, use of lists, libraries, functions/methods, and classes.

We will investigate the following questions:

- *Do smokers differ from nonsmokers?*
    - Their average age and age range?
    - Their average BMI?
    - Their average number of children?
    - The percentage of men and women?
- *What is the effect of smoking on insurance costs?*
    - What is the average annual insurance charge of people who smoke vs people who don't?
- *Are there regional differences in smoking incidence*?
    - Is the percentage of smokers higher in one region versus others?

In [1]:
import csv

First, we import the CSV using a helper function that stores all information in separate lists.

In [464]:
list_age = []
list_sex = []
list_bmi = []
list_children = []
list_smoker = []
list_region = []
list_charges = []

def get_data(filename, list_name, column):
    with open(filename, "r") as insurance:
        reader = csv.DictReader(insurance)
        for row in reader:
            list_name.append(row[column])

get_data("insurance.csv", list_age, "age")
get_data("insurance.csv", list_sex, "sex")
get_data("insurance.csv", list_bmi, "bmi")
get_data("insurance.csv", list_children, "children")
get_data("insurance.csv", list_smoker, "smoker")
get_data("insurance.csv", list_region, "region")
get_data("insurance.csv", list_charges, "charges")

# Our data set
Our data has the following variables:
- ```age```: a person's age, in years.
- ```sex```: a person's sex. In this dataset, it is either ```'female'``` or ```'male'```.
- ```bmi```: a person's bmi. Presented with one decimal. 
- ```children```: a person's number of children.
- ```smoker```: whether a person smokes or not. In this data set is is either ```'yes'``` or ```'no'```.
- ```region```: the region where a person lives. In this data set, it is one of four regions: ```'northwest'```, ```'southwest'```, ```'northeast'``` and ```'southeast'```.
- ```charges```: a person's annual insurance charge, presented with several decimals.


# Data processing

To process the data, we create a ```Patient``` class and we store all the data in this class for easy access. 

The ```Patient``` class contains the following methods:

- ```get_average```: Takes an argument. Returns the average of the argument passed to the function.
- ```analyze_sexes```: Analyzes the genders found in the data and returns the number of people reporting each gender in the data, in a dictionary.
- ```has_children```: Returns a tuple with the number of people who have at least one child, and the number of people who do not.
- ```unique_regions```: Returns a list of all unique regions found in the data.
- ```count_regions_dict```: Returns a dictionary with the unique regions in the data and the number of people represented.
- ```create_dictionary```: Converts the data into a dictionary for further use.
- ```smoker_dictionary```: Returns a nested dictionary that contains separate data for smokers and nonsmokers.
- ```analyze_smoker```: Returns a dictionary with separate summary data for smokers and nonsmokers. 
- ```analyze_regions_smokers```: Returns a dictionary using each unique region as a key, and the number of smokers, the number of nonsmokers, and the percentage of smokers as values.

In [576]:
class Patient:

    def __init__(self, pt_age, pt_sex, pt_bmi, pt_children, pt_smoker, pt_region, pt_charges):
        self.age = pt_age
        self.sex = pt_sex
        self.bmi = pt_bmi
        self.children = pt_children
        self.smoker = pt_smoker
        self.region = pt_region
        self.charges = pt_charges
        
    def get_average(self, variable):
        total = 0
        for row in variable:
            total += float(row)
        return total / len(variable)
    
    def analyze_sexes(self):
        keys = list(set(self.sex))
        return {key:self.sex.count(key) for key in keys for person in self.sex}
    
    def has_children(self):
        self.has_child_num = 0
        self.has_no_child_num = 0
        for entry in self.children:
            if float(entry) > 0:
                self.has_child_num += 1
            else:
                self.has_no_child_num += 1
        return self.has_child_num, self.has_no_child_num
    
    def unique_regions(self):
        self.unique_regions_list = []
        for region in self.region:
            if not region in self.unique_regions_list:
                self.unique_regions_list.append(region)
            else:
                continue
        return self.unique_regions_list
    
    def count_regions_dict(self):
        region_list = [row for row in self.region]
        keys = self.unique_regions()
        return {key:region_list.count(key) for key in keys for region in region_list}
                
    def create_dictionary(self):
        self.dictionary = {}
        self.dictionary["Age"] = [int(age) for age in self.age]
        self.dictionary["Sex"] = self.sex
        self.dictionary["BMI"] = [float(bmi) for bmi in self.bmi]
        self.dictionary["Children"] = [int(child) for child in self.children]
        self.dictionary["Smoker"] = self.smoker
        self.dictionary["Region"] = self.region
        self.dictionary["Charges"] = [float(charge) for charge in self.charges]
        return self.dictionary
    
    def smoker_dictionary(self):
        self.zipper = list(zip(self.smoker, self.age, self.sex, self.bmi, self.children, self.region, self.charges))
        self.smoker_dict = {"Smokers":
                      {"Age": [row[1] for row in self.zipper if row[0] == "yes"],
                      "Sex": [row[2] for row in self.zipper if row[0] == "yes"],
                       "BMI": [row[3] for row in self.zipper if row[0] == "yes"],
                      "Children": [row[4] for row in self.zipper if row[0] == "yes"],
                       "Region": [row[5] for row in self.zipper if row[0] == "yes"],
                       "Charges": [row[6] for row in self.zipper if row[0] == "yes"]},
                      "Nonsmokers": 
                       {"Age": [row[1] for row in self.zipper if row[0] == "no"],
                      "Sex": [row[2] for row in self.zipper if row[0] == "no"],
                        "BMI": [row[3] for row in self.zipper if row[0] == "no"],
                      "Children": [row[4] for row in self.zipper if row[0] == "no"],
                       "Region": [row[5] for row in self.zipper if row[0] == "no"],
                       "Charges": [row[6] for row in self.zipper if row[0] == "no"]}}
        return self.smoker_dict
        
    def analyze_smoker(self):
        dict_smokers = self.smoker_dictionary()
        return {"Smokers": 
                {"Average age": self.get_average(dict_smokers["Smokers"]["Age"]),
                 "Age range in data": "{} - {}".format(min(dict_smokers["Smokers"]["Age"]), 
                                               max(dict_smokers["Smokers"]["Age"])),
                 "Percentage male": dict_smokers["Smokers"]["Sex"].count("male") / 
                 len(dict_smokers["Smokers"]["Sex"]) * 100,
                "Average BMI": self.get_average(dict_smokers["Smokers"]["BMI"]),
                "Average number of children": self.get_average(dict_smokers["Smokers"]["Children"]),
                "Average charges": self.get_average(dict_smokers["Smokers"]["Charges"]),
                 },
               "Nonsmokers":
                {"Average age": self.get_average(dict_smokers["Nonsmokers"]["Age"]),
                "Age range in data": "{} - {}".format(min(dict_smokers["Nonsmokers"]["Age"]), 
                                               max(dict_smokers["Nonsmokers"]["Age"])),
                 "Percentage male": dict_smokers["Nonsmokers"]["Sex"].count("male") / 
                 len(dict_smokers["Nonsmokers"]["Sex"]) * 100,
                "Average BMI": self.get_average(dict_smokers["Nonsmokers"]["BMI"]),
                "Average number of children": self.get_average(dict_smokers["Nonsmokers"]["Children"]),
                "Average charges": self.get_average(dict_smokers["Nonsmokers"]["Charges"])}
               }

    def analyze_regions_smokers(self):
        regions = self.unique_regions()
        dictionary = self.smoker_dictionary()
        return {region:
                {"Smokers": dictionary["Smokers"]["Region"].count(region),
                 "Nonsmokers": dictionary["Nonsmokers"]["Region"].count(region),
                "Percentage smokers": dictionary["Smokers"]["Region"].count(region) /
                 (dictionary["Smokers"]["Region"].count(region) + 
                  dictionary["Nonsmokers"]["Region"].count(region)) * 100} 
                for region in regions}
        
patient = Patient(list_age, list_sex, list_bmi, list_children, list_smoker, list_region, list_charges)
dict_data = patient.create_dictionary()

# Exploring the data

First, we look at the data to obtain information about the sample as a whole.

In [578]:
patient.analyze_sexes()

{'female': 662, 'male': 676}

There are 662 female people in the data set and 676 male people. Thus, in this sample, there are 1338 entries; 49.5% female, 50.5% male.

In [585]:
round(patient.get_average(list_age), 1)

39.2

The average age of the people in the sample is 39.2 years old. Given that the median age of people in the U.S. is about [38 years old](https://datacommons.org/place/country/USA?utm_medium=explore&mprop=age&popt=Person&hl=en), based on this average, the data seems to be representative. However, further investigation would be necessary to be more certain.

In [584]:
"People with children: {}. People without children: {}".format(patient.has_children()[0], 
                                                               patient.has_children()[1])

'People with children: 764. People without children: 574'

There are 764 people in the sample who have children; 574 who do not. The fact that the majority of people in the sample have at least one child, is plausible for the average age of the people in the data set. In a younger sample (e.g., high school students), one might expect that there would be more people who don't have a child.

In [587]:
round(patient.get_average(list_bmi), 1)

30.7

The average BMI in this data set is 30.7, which would be classified as [Obese, Class I](https://www.cdc.gov/obesity/adult/defining.html). It is possible, however, that this number is greatly influenced by outliers; we should be careful to interpret this figure entirely on its own. For example, according to data from a 2015 World Health Organization report, the average U.S. BMI was reported to be [28.5](https://en.wikipedia.org/wiki/List_of_countries_by_body_mass_index#cite_note-1). The number found here is somewhat higher than that average.

In [235]:
patient.count_regions_dict()

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}

The data contains four regions; entries are more or less uniformly distributed across the United States. 

This might indicate that some areas could be overrepresented; for example, the northeast region of the U.S. is far more densely populated than the northwest, but that does not show in this data.

In [540]:
smoker = patient.smoker_dictionary()

print(len(smoker["Smokers"]["Age"]))
print(len(smoker["Nonsmokers"]["Age"]))

274
1064


There are 274 smokers in the sample and 1064 nonsmokers. 

Using the ```analyze_smoker()``` method, we can investigate differences between smokers and nonsmokers:

In [575]:
patient.analyze_smoker()

{'Smokers': {'Average age': 38.51459854014598,
  'Age range in data': '18 - 64',
  'Percentage male': 58.02919708029197,
  'Average BMI': 30.708448905109503,
  'Average number of children': 1.1131386861313868,
  'Average charges': 32050.23183153285},
 'Nonsmokers': {'Average age': 39.38533834586466,
  'Age range in data': '18 - 64',
  'Percentage male': 48.59022556390977,
  'Average BMI': 30.651795112781922,
  'Average number of children': 1.0902255639097744,
  'Average charges': 8434.268297856199}}

From our analysis above, we can conclude that smokers and nonsmokers are highly similar in most demographic characteristics in our sample. 


**Age**: The average ages of smokers and nonsmokers are highly similar, although nonsmokers in the sample tend to be slightly older, on average. The age ranges are the same, though it is possible that the distribution of ages is not exactly the same in the two groups.

**Sex**: Among the smokers, the percentage of men is higher than among the nonsmokers. 


**BMI**: The average BMIs of smokers and nonsmokers in the sample are also nearly the same. 


**Children**: The average number of children is the same for smokers as for nonsmokers.


**Annual Charges**: The average charges paid by smokers vs nonsmokers, however, are very different. Smokers pay almost four times more charges, annually. Without more information about the data source, it is difficult to say why. One possibility is that people who smoke may encounter more medical issues than people who do not, and therefore need to pay higher insurance costs. If we wanted to investigate this issue further, we could create segments in the data for each age group and see whether charges differ for smokers vs nonsmokers in each group. Since health issues associated with smoking tend to occur at later ages, we could check whether smokers pay more than nonsmokers in the same age group.

In [577]:
patient.analyze_regions_smokers()

{'southwest': {'Smokers': 58,
  'Nonsmokers': 267,
  'Percentage smokers': 17.846153846153847},
 'southeast': {'Smokers': 91, 'Nonsmokers': 273, 'Percentage smokers': 25.0},
 'northwest': {'Smokers': 58,
  'Nonsmokers': 267,
  'Percentage smokers': 17.846153846153847},
 'northeast': {'Smokers': 67,
  'Nonsmokers': 257,
  'Percentage smokers': 20.679012345679013}}

From the data above, we can conclude that smoking is (by far) most common in the southeast region of the U.S., followed by the northeast. Smoking is relatively less common in both western regions of the U.S..

# Conclusions

Regarding our questions, from this data, we can conclude the following:

- *Do smokers differ from nonsmokers (age, age range, BMI, number of children, men/women)?*

We can conclude that smokers and nonsmokers (in this sample, at least) are similar in terms of age, age range, average BMI, and average number of children. Amongst smokers, the percentage of men is somewhat higher than amongst nonsmokers. 



- *What is the effect of smoking on insurance costs?*

The annual insurance charges of smokers are about four times higher than those of nonsmokers. 



- *Are there regional differences in smoking incidence*?

Amongst people living in the southeast region of the United States, smoking is relatively common (25%), compared to the northeast (20.7%), northwest (17.8%) and the southwest (17.8%) regions of the country.

# Suggestions

Based on what we learned from the current data, we might be interested in the following further exploration of this data:

- Build a model to predict insurance cost, and based on this model, find recommendations for individuals to lower their annual insurance cost.
- Investigate the interaction of different demographic variables. For example, does age play a role in gender differences regarding smoking? In other words, do gender differences in smoking change as people get older?
- Analyze smoking charges for smokers vs nonsmokers further. In particular, find out how the discrepancy between smokers and nonsmokers develops over time. Do smokers always have a higher insurance cost, or does this perhaps co-occur with medical issues that develop over time as a result of smoking?
- The current data only allows us to compare groups cross-sectionally (that is, look at differences between people at a fixed point in time). It would be very interesting to track individuals, their demographic characteristics, and their insurance charges over time. Using data from one point in time, we might try to predict, for example, whether someone will start smoking at a later point in time.