# Kaggle Data Science Project: US President Height Analysis

By Srushti Shimpi

**This project is from Kaggle. You can find it [HERE](https://www.kaggle.com/srushtishimpi/us-president-height-analysis).**

## Importing Library

In [92]:
#importing the library
import numpy as np
import pandas as pd
import plotly as plt
import plotly.express as px

## Reading the data

In [93]:
#Reading data
data = pd.read_csv("president_heights.csv")
print(data.columns)
print(data.head())
print(data.tail())

Index(['order', 'name', 'height(cm)'], dtype='object')
   order               name  height(cm)
0      1  George Washington         189
1      2         John Adams         170
2      3   Thomas Jefferson         189
3      4      James Madison         163
4      5       James Monroe         183
    order               name  height(cm)
37     40      Ronald Reagan         185
38     41  George H. W. Bush         188
39     42       Bill Clinton         188
40     43     George W. Bush         182
41     44       Barack Obama         185


## Issues in Data

1. Need to rename the column height(cm). 
2. Check for the missing data as order does not match with the index

In [94]:
#1. Renaming the column height(cm) with height using rename() function
data = data.rename({"height(cm)":"height"}, axis="columns")
data.columns

Index(['order', 'name', 'height'], dtype='object')

In [95]:
#2.Finding if there is any missing orders
order = data["order"]
[print(data.loc[index]) for index, numOrder in enumerate((order)[:-1]) if numOrder+1 != order[index+1]]

order                    21
name      Chester A. Arthur
height                  183
Name: 20, dtype: object
order                    23
name      Benjamin Harrison
height                  168
Name: 21, dtype: object


[None, None]

If we include all US presidents till this date, there are 4 missing data. Grover Cleavland seved as 22nd and 24th president of US.Donald Trump was 45th POTUS and current 46th POTUS is Joseph Biden. 


Let's check Grover Cleavland, Donald Trump and Joseph Biden are listed somewhere else in the data?

In [96]:
#Looking for following presidents in given data
[x for x in data.name if ("Grover Cleveland" or "Donald Trump" or "Joseph Biden" ) in x]


[]

This confirms that Grover Cleveland, Donald Trump and Joseph Biden are not in the given data.

We can add their heights into the data. As all 3 of them were presidents of the US, its is important to include their data.
According to Google, Grover has 180cm height, Donald has 190cm height and Joseph has 182cm height.

In [98]:
#Adding Grover Cleavland's data at 22nd and 24th order
addedRow = 0
for x in [22,24]:
  
  insdata = pd.DataFrame({"order": x , "name": "Grover Cleveland", "height":180}, index=[x - addedRow - 1.5])
  data = pd.concat([data.iloc[:x-2], insdata, data.iloc[x-2:]])
  addedRow = addedRow + 1



In [99]:
#Adding Donald Trump's data at 45th order

addedRow = 0
for x in [45]:
  
  insdata = pd.DataFrame({"order": x , "name": "Donald Trump", "height":190}, index=[x - addedRow - 1.5])
  data = pd.concat([data.iloc[:x-2], insdata, data.iloc[x-2:]])

In [100]:
#Adding Joseph Biden's data at 46th order

addedRow = 0
for x in [46]:
  
  insdata = pd.DataFrame({"order": x , "name": "Joseph Biden", "height":182}, index=[x - addedRow - 1.5])
  data = pd.concat([data.iloc[:x-2], insdata, data.iloc[x-2:]])

In [101]:
#Resetting the data after adding 4 rows
data = data.sort_index().reset_index(drop=True)

#Shows Grover Cleavland's data in the given data
data[20:25]

Unnamed: 0,order,name,height
20,21,Chester A. Arthur,183
21,22,Grover Cleveland,180
22,22,Grover Cleveland,180
23,23,Benjamin Harrison,168
24,24,Grover Cleveland,180


In [102]:
#Shows Donald Trump and Joseph Biden's data in the given data
data[41:46]

Unnamed: 0,order,name,height
41,40,Ronald Reagan,185
42,41,George H. W. Bush,188
43,42,Bill Clinton,188
44,43,George W. Bush,182
45,44,Barack Obama,185


Our issues are now resolved.

## Exploratory Data Analysis


In [103]:
#Data description
print(data.describe())

           order      height
count  48.000000   48.000000
mean   23.479167  180.020833
std    13.135933    6.724011
min     1.000000  163.000000
25%    12.750000  175.000000
50%    23.500000  181.000000
75%    34.250000  183.000000
max    46.000000  193.000000


In [104]:
#Data information
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 3 columns):
order     48 non-null int64
name      48 non-null object
height    48 non-null int64
dtypes: int64(2), object(1)
memory usage: 1.2+ KB
None


## Obesrvations

In [105]:
#Sorting data in ascending order w.r.t. height and printing top 5 tallest presidents and 5 shortest presidents
print(data.sort_values("height").head(5))
print(data.sort_values("height").tail(5))

    order               name  height
3       4      James Madison     163
23     23  Benjamin Harrison     168
7       8   Martin Van Buren     168
1       2         John Adams     170
26     25   William McKinley     170
    order               name  height
2       3   Thomas Jefferson     189
0       1  George Washington     189
46     45       Donald Trump     190
37     36  Lyndon B. Johnson     193
15     16    Abraham Lincoln     193


In [106]:
#Printing tallest and shortest presidents
print("The tallest US presidents:\n", data.loc[data["height"] == data.max()["height"]])
print("\n\nThe shortest US presidents:\n", data.loc[data["height"] == data.min()["height"]])

The tallest US presidents:
     order               name  height
15     16    Abraham Lincoln     193
37     36  Lyndon B. Johnson     193


The shortest US presidents:
    order           name  height
3      4  James Madison     163


## Summary of Data

* Total data entries: 44 items
* Maximum height: 193cm (Abraham Lincoln,Lyndon B. Johnson)
* Minimum height: 163cm (James Madison)
* Mean height: 180.021739
* 1st quartile(25%), Q1: 175.00000
* Median(50%), Q2: 182.000000
* 3rd quartile(75%), Q3: 183.000000
* Standard deviation: 6.871808

## Visualization

While studing and analyizing the heights of US presidents, we will aim on keeping US presidents' unique data, ie no duplicate president. 

**Note:**

data= Data of all presidents with duplicate presidents

newData= data with no duplicate presidents

In [108]:
#Creating new data with no duplicate presidents
newData = data.drop_duplicates(subset='name', keep='first')
newData.head()

Unnamed: 0,order,name,height
0,1,George Washington,189
1,2,John Adams,170
2,3,Thomas Jefferson,189
3,4,James Madison,163
4,5,James Monroe,183


### Scatter Plot

In [113]:
fig = px.scatter(newData, x="order", y="height", trendline="ols")
fig.layout.yaxis.title.text="Height in cm"
fig.layout.xaxis.title.text="Order of Presidents"
fig.show()

Distribution of heights of US Presidents

* Total 14 presidents are between 180cm-185cm height range, which is more than presidents any other range.
* There are only 2 president thats are above 190cm. 
* There is only 1 president below 165cm height.

This treadline in the scatter plot shows equation of the line and its R-squared value and shows increament in the heights of US presidents.

In [114]:
#Result and summary of treadline of the scatter plot
results = px.get_trendline_results(fig)
print(results)

results.px_fit_results.iloc[0].summary()

                                      px_fit_results
0  <statsmodels.regression.linear_model.Regressio...


0,1,2,3
Dep. Variable:,y,R-squared:,0.152
Model:,OLS,Adj. R-squared:,0.132
Method:,Least Squares,F-statistic:,7.7
Date:,"Sat, 05 Mar 2022",Prob (F-statistic):,0.00814
Time:,04:31:35,Log-Likelihood:,-146.88
No. Observations:,45,AIC:,297.8
Df Residuals:,43,BIC:,301.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,175.3359,1.945,90.138,0.000,171.413,179.259
x1,0.1995,0.072,2.775,0.008,0.055,0.345

0,1,2,3
Omnibus:,0.805,Durbin-Watson:,2.285
Prob(Omnibus):,0.669,Jarque-Bera (JB):,0.651
Skew:,0.287,Prob(JB):,0.722
Kurtosis:,2.864,Cond. No.,54.6


### Line Graph

In [115]:
fig = px.line(newData, x="order", y="height")
fig.layout.yaxis.title.text="Height in cm"
fig.layout.xaxis.title.text="Order of Presidents"
fig.show()

This line graph shows no sequence in heights of presidents, therefore it is non-monotonic in terms of heights.

### Line Graph

In [116]:
fig = px.line(newData, x='order', y='height', color='name', markers=True)
fig.layout.yaxis.title.text="Height in cm"
fig.layout.xaxis.title.text="Order of Presidents"
fig.show()

According to this graph, there are highest numbere of presidents in range of orders from 31-40

### Bar Graph

In [117]:
fig =  px.bar( newData, x="order", y="height", orientation='h',color ='name')
fig.layout.yaxis.title.text="Height in cm"
fig.layout.xaxis.title.text="Order of Presidents"
fig.show()

This bar graph show presidents with same height. Maximum number of presidents (8 presidents) have height (183cm) in common.

### Histogram

In [118]:
fig = px.histogram(newData, x="height", color='name')
fig.layout.yaxis.title.text="Number of Presidents"
fig.layout.xaxis.title.text="Height in cm"

fig.show()

Histogram Summary:
* There are 14 presidents in range of height 180cm to 185cm
* There is 1 president in range of height 160cm to 165cm
* In ranges 170cm to 175cm, 175cm to 180cm and 185cm to 190cm, 8 presidents are the in each range.

**This dataset is from Kaggle. You can find it [HERE](https://www.kaggle.com/onkardhavan/us-president-height-dataset).**

### THANK YOU!