### Big Data CS696 - Assignment 2 - Somnath Shantveer (RedId - 823379096)

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as st

vehicle_data_file = 'assignment2Data/vehicles.csv'
causes_of_death_file = 'assignment2Data/causes_of_death.csv'
framingham_file = 'assignment2Data/framingham.csv'

### Issue 1 : Fuel economy analysis
- Analysing the fuel economy of the vehicles sold in US from year 2000 to 2019
- Vehicle companies considered in analysis - GMC, Ford, Honda, Toyota, Chrysler

In [None]:
vehicle_data = pd.read_csv(vehicle_data_file, 
                           usecols=['make', 'model', 'year', 'comb08', 'atvType', 'fuelType1'])

# Filter vehicle data based on year.
vehicle_data = vehicle_data[vehicle_data.eval('year >= 2000 & year <= 2019')]

#Filter vehicle data based on make.
vehicle_make_list = ['Honda', 'Acura', 
                        'Toyota', 'Lexus', 'Scion', 
                        'GMC', 'Buick', 'Cadillac', 'Chevrolet', 
                        'Ford', 'Lincoln', 
                        'Chrysler', 'Dodge', 'Jeep', 'Ram']
vehicle_data = vehicle_data[((vehicle_data['make']).isin(vehicle_make_list))]

#Combine vehicle makes based on parent company. Replace 'make' with parent company name
vehicle_data['make'] = vehicle_data['make'].replace(['GMC','Buick','Cadillac', 'Chevrolet'], 'General Motors').replace(['Lexus','Scion'], 'Toyota').replace(['Lincoln'], 'Ford').replace(['Dodge','Jeep', 'Ram'], 'Chrysler').replace(['Acura'], 'Honda')

### 1.1 
For each company collect the MPG sold by each company in the years 2000-2019. Produce
the box plots per company for the MPG over those years. How do the companies
compare?

In [None]:
# Filter vehicle data based on fuel type. Only Gasoline vehiclesHybrids are considered)
vehicle_data_gasoline = vehicle_data[(vehicle_data['fuelType1'].str.contains('Gasoline'))]

plt.figure(figsize=(20, 8))
sns.boxplot(data=vehicle_data_gasoline, x='make', y ='comb08')

In [None]:
# Filter vehicle data based on fuel type. Only Gasoline vehicles (Hybrids are not considered). No alternate fuel
vehicle_data_gasoline_no_hybrid = vehicle_data_gasoline[vehicle_data_gasoline['atvType'].isnull()]

plt.figure(figsize=(20, 8))
sns.boxplot(data=vehicle_data_gasoline_no_hybrid, x='make', y ='comb08')

### Analysis
Box plot for vehicles using gasoline (Hybrids are considered)
- Honda vehicles has better fuel economy spread in the given years.
- There are a lot of out liers from Toyota, which could be due to hybrid vehicles whose fuel economy(miles per gallon) is better.

Box plot for vehicles using gasoline (Hybrids are not considered)
- Honda vehicles has better fuel economy spread in the given years.
- Overall fuel economy is less when we remove the hybrid vehicles from our analysis.

### 1.2
Plot the yearly mean in the years 2000- 2019 with confidence interval of the mpg for each company. 
That is for each company compute the mean mpg over all vehicles sold by that company per year. 
What changes have there been in those years? How do the companies compare?

In [None]:
# Miles per gallon mean grouped by make per year. (Including hybrid vehicles)
mpg_mean = vehicle_data_gasoline.groupby(['make', 'year'], as_index=False).mean().reset_index()

sns.lmplot(x='year', y='comb08', data=mpg_mean, height=8, aspect=1.5, hue='make')

In [None]:
# Miles per gallon mean grouped by make per year. (Not including hybrid vehicles)
mpg_mean_no_hybrid = vehicle_data_gasoline_no_hybrid.groupby(['make', 'year'], as_index=False).mean()
mpg_mean_no_hybrid.reset_index

sns.lmplot(x='year', y='comb08', data=mpg_mean_no_hybrid, height=8, aspect=1.5, hue='make')

### Analysis
- The average fuel efficency has increased for all the makes during the year 2000 to 2019
- Honda and Toyota vehicles have better mean fuel efficency during these years.
- GM vehicles have less fuel efficency compared to other makes. Because of bigger vehicles!
- There was a drop in Honda and Toyota mean values from year 2006 to 2013.
- When comparing the fuel efficency mean of vehicles (with hybrids and without hybrids), we can see some
makes of vehicle (ex: Toyota) having different growth line. It could be because of their efficient Prius!

### 1.3
Plot the mpg for each company per year of their most fuel efficient vehicle each year. What
changes have there been in those years? How do the companies compare?

In [None]:
# mpg grouped by make per year and get max value. (Including hybrid vehicles)
mpg_max_yearly = vehicle_data_gasoline.groupby(['make', 'year'], as_index=False).max().reset_index()
sns.lmplot(x='year', y='comb08', data=mpg_max_yearly, height=10, aspect=1.5, hue='make')

# mpg grouped by make per year and get max value. (Not including hybrid vehicles)
mpg_max_yearly = vehicle_data_gasoline_no_hybrid.groupby(['make', 'year'], as_index=False).max().reset_index()
sns.lmplot(x='year', y='comb08', data=mpg_max_yearly, height=10, aspect=1.5, hue='make')

### Analysis
- Efficient vehicle from each company has indication of improved mpg over the years except Honda.
- Honda had a dip in mpg from 2006 to 2011 for their efficient vehicle.
- Toyota had most efficient vehicle from 2006.

### Issue 2 - Diet and Death

### Causes of death. 
Plot the death rate for each disease over time from the data set causes_of_death.csv.

In [None]:
causes_of_death = pd.read_csv(causes_of_death_file)
sns.lmplot(x='Year', y='Age Adjusted Death Rate', data=causes_of_death, height=8, aspect=1.5, hue='Cause')

## Diabetes and the population. 
The data set in framingham.csv contains information from the Framingham Heart Study of 5,209 adults.
First to check if the sample of people in the study is representative of the general population.
We will use diabetes to test this. 
The CDC indicates that prevalence (percent) of diabetes was 0.93% at the time of the study. 
Our hypothesis:

- Null Hypothesis: The probability that a participant within the Framingham Study has diabetes is
equivalent to the prevalence of diagnosed diabetes within the population. (i.e., any difference
is due to chance).
- Alternative Hypothesis: The probability that a participant within the Framingham Study has diabetes
is different than the prevalence of diagnosed diabetes within the population.
In the framingham.csv file the column DIABETES contains 1 for people with diabetes and 0 for
those without.

### 4. What is the percentage of people in the study that have diabetes?

In [None]:
# Calculate percentage of people having diabetes in this study.
farmingham_study_data = pd.read_csv(framingham_file)
diabetes_pct = farmingham_study_data['DIABETES'].value_counts(normalize=True)

print("Percentage of people in the study that have diabetes = ", (diabetes_pct[1]*100).round(3))

Now we need to compare this to the general population. Either a person is diagnosed as having
diabetes or not. We can use the multinomial distribution to generate a sample of two values.
Say we have an event that has .75 probability of occurring. Then the following will count
the number of times the event does not occur and occur in a sample of 1000.

two_value_probabilities = [0.25, 0.75]
sample_size = 1000
np.random.multinomial(sample_size, two_value_probabilities)

Using this we can compute the number of people we would expect to have diabetes in a sample of 5,000, which we need to convert to a percentage. Now do this 200 times.

In [None]:
# Using the percentage of the people with diabetes from the study, lets gnerate the sample to compare with general population.
diabetes_probabilities = [diabetes_pct[0], diabetes_pct[1]]
sample_size = 5000

general_population_diabetes = pd.DataFrame(np.random.multinomial(sample_size, diabetes_probabilities, 200))
general_population_diabetes.columns = ['non-diabetic', 'diabetic']

# Convert to percentage
general_population_diabetes = (100.*general_population_diabetes/sample_size).round(3)

### 5. Produce the histogram of the percent of people in your 200 samples with diabetes.

In [None]:
#Plot histogram of the diabetic percentage from the general population findings
general_population_diabetes.hist(column='diabetic', bins=25, grid=False, figsize=(12,8))

### 6. Compute the 95% confidence interval of the 200 values in 5

In [None]:
# returns confidence interval of mean
diabetic_confidence_interval = st.t.interval(0.95, len(general_population_diabetes["diabetic"])-1, 
              loc=np.mean(general_population_diabetes["diabetic"]), 
              scale=st.sem(general_population_diabetes["diabetic"]))
print(diabetic_confidence_interval)

### 7. Is the study representative of the general population? Why or why not?

- Yes, study is representative of the general population.
- The percentage of people having diabetes is 2.733%
- When verified with general population, the 95% confidence interval is ~ (2.703 to 2.770)
- The study diabetic percentage is within the 95% confidence range and hence represents the population (~95% accuracy). 

### 8. Plot the cholesterol values
In the file framingham.csv the column TOTCHOL gives the total cholesterol of each person in
the study. The column ANYCHD indicates if the person has any heart disease.

Plot the cholesterol values for the people with heart disease, for the people with out heart
disease.

In [None]:
farmingham_study_data = pd.read_csv(framingham_file)

# Ploting cholestrol data based on heart disease(ANYCHD).
plt.figure(figsize=(10, 8))
sns.boxplot(data=farmingham_study_data, x='ANYCHD', y ='TOTCHOL')

plt.figure(figsize=(10, 8))
sns.violinplot(x='ANYCHD', y='TOTCHOL', data=farmingham_study_data, height=8)

plt.figure(figsize=(10, 8))
sns.jointplot(x='ANYCHD', y='TOTCHOL', data=farmingham_study_data, height=8)

sns.lmplot(y='TOTCHOL', x='AGE', data=farmingham_study_data, height=10, aspect=1, hue='ANYCHD')

### 9. Compute the 95% confidence. 
Compute the 95% confidence interval of the cholesterol values for the people with heart disease, for the people with out heart disease.



In [None]:
#Lets create two groups, people with heart disease and without heart disease
data_with_heart_disease = farmingham_study_data[farmingham_study_data['ANYCHD'] == 1]
data_without_heart_disease = farmingham_study_data[farmingham_study_data['ANYCHD'] == 0]

chol_confidence_interval_with_heart_disease = st.t.interval(0.95, len(data_with_heart_disease["TOTCHOL"])-1, 
              loc=np.mean(data_with_heart_disease["TOTCHOL"]), 
              scale=st.sem(data_with_heart_disease["TOTCHOL"]))
print('95% Confidence interval of the cholestrol for people with heart disease',
      chol_confidence_interval_with_heart_disease)

chol_confidence_interval_without_heart_disease = st.t.interval(0.95, len(data_without_heart_disease["TOTCHOL"])-1, 
              loc=np.mean(data_without_heart_disease["TOTCHOL"]), 
              scale=st.sem(data_without_heart_disease["TOTCHOL"]))
print('95% Confidence interval of the cholestrol for people with out heart disease', 
      chol_confidence_interval_without_heart_disease)


### 10. What can we deduce about cholesterol values and heart disease?
Answer: 
By looking at the graph and calculated confidence interval,
- People with heart disease tend to have higher cholestrol level. 
  (or) People with high cholestrol level tend to get heart disease.
- 95% of people having heart disease have cholestrol levels in the range - 246.5 to 252.4
- 95% of people without heart disease have cholestrol levels in the range - 231.3 to 234.4