---
# Computing Aggregate Statistics

In [None]:
import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]

GradeList = list(zip(names,grades))
df = pd.DataFrame(data = GradeList, columns=['Names', 'Grades'])
df

In [None]:
df['Grades'].count() # computes the number of values

df['Grades'].mean() # computes the arithmetic average of the values

df['Grades'].std() # computes the standard deviation of the values

df['Grades'].min() # computes the minimum of the values
df['Grades'].max() # computes the maximum of the values

df['Grades'].quantile(.25) # computes the first quartile of the values
df['Grades'].quantile(.5) # computes the second quartile of the values
df['Grades'].quantile(.75) # computes the third quartile of the values

### Note
If you tried to execute the previous code in one cell all at the same time, the only thing you will see is the output of the .quantile() function. You have to try them one by one. I just grouped them all together for reference purposes. OK?

### Other Measures of Central Tendency

In [None]:
# computes the arithmetic average of the values in a column
# mean = dividing the sum of all values by the number of values
df['Grades'].mean()

# finds the median of the values in a column
# median = the middle value if they are sorted in order
df['Grades'].median()

# finds the mode of the values in a column
# mode = the most common single value
df['Grades'].mode()

In [None]:
df['Grades'].var()

In [None]:
df.var()

### Your Turn
Of course, in our dataset we only have one column. Try creating a dataframe and computing summary statistics using the following dataset.

In [None]:
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]

---
# Computing Aggregate Statistics on Matching Rows

In [None]:
import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]

GradeList = list(zip(names,grades,bsdegrees,msdegrees,phddegrees))

df = pd.DataFrame(data=GradeList, columns=['Name','Grade','BS','MS','PhD'])
df

In [None]:
df.loc[df['PhD']==0].count()

In [None]:
df.loc[df['PhD']==0]['Grade'].mean()

### Your Turn
Using the following data, what is the average grade for people with MS degrees?

In [None]:
import pandas as pd
  
names = ['Bob','Jessica','Mary','John','Mel','Sam','Cathy','Henry','Lloyd']
grades = [76,95,77,78,99,84,79,100,73]
bsdegrees = [1,1,0,0,1,1,1,0,1]
msdegrees = [2,1,0,0,0,1,1,0,0]
phddegrees = [0,1,0,0,0,2,1,0,0]

---
# Sorting Data

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

In [None]:
df = df.sort_values(by='age', ascending=0)
df.head()

In [None]:
df = df.sort_values(by=['grade', 'age'], ascending=[True, True])
df.head()

### Your Turn
Can you sort the dataframe to order it by name, age and then grade?

---
# Correlation

In [None]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

In [None]:
df.corr()

### Your Turn
Load the data in the following code and find the correlations:

In [1]:
import pandas as pd
 
Location = "datasets/tamiami.csv"

---
# Regression

In [2]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [4]:
import statsmodels.formula.api as sm
result = sm.ols(formula='grade ~ age + exercise + hours', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1315.0
Date:,"Sat, 17 Feb 2018",Prob (F-statistic):,0.0
Time:,12:09:39,Log-Likelihood:,-6300.7
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1996,BIC:,12630.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,57.8704,1.321,43.804,0.000,55.279,60.461
age,0.0397,0.075,0.532,0.595,-0.107,0.186
exercise,0.9893,0.089,11.131,0.000,0.815,1.164
hours,1.9165,0.031,61.564,0.000,1.855,1.978

0,1,2,3
Omnibus:,321.187,Durbin-Watson:,2.047
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2196.187
Skew:,-0.567,Prob(JB):,0.0
Kurtosis:,8.007,Cond. No.,213.0


In [5]:
import statsmodels.formula.api as sm
result = sm.ols(formula='grade ~ exercise + hours', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Sat, 17 Feb 2018",Prob (F-statistic):,0.0
Time:,12:10:01,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
exercise,0.9892,0.089,11.131,0.000,0.815,1.163
hours,1.9162,0.031,61.575,0.000,1.855,1.977

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


In [6]:
import pandas as pd

Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

df.head()

result = sm.ols(formula='grade ~ age + exercise + hours - 1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,72840.0
Date:,"Sat, 17 Feb 2018",Prob (F-statistic):,0.0
Time:,12:10:14,Log-Likelihood:,-6974.3
No. Observations:,2000,AIC:,13950.0
Df Residuals:,1997,BIC:,13970.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
age,3.1129,0.035,88.030,0.000,3.044,3.182
exercise,1.7659,0.122,14.482,0.000,1.527,2.005
hours,2.2860,0.042,54.486,0.000,2.204,2.368

0,1,2,3
Omnibus:,131.221,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,403.367
Skew:,-0.301,Prob(JB):,2.5700000000000003e-88
Kurtosis:,5.116,Cond. No.,14.2


### Your Turn
Create a new column where you convert gender to numeric values like 1 for female and 0 for male. Can you now add gender to your regression? Does this improve your R-squared?

In [7]:
import pandas as pd
import statsmodels.formula.api as sm
Location = "datasets\gradedata.csv"
df = pd.read_csv(Location)

In [8]:
def score_to_numeric(x):
    if x=='female':
        return 1
    if x=='male':
        return 0

In [None]:
df['gender']=df['gender'].apply(score_to_numeric)
df.head()

In [13]:
result = sm.ols(formula='grade ~ gender + age + exercise + hours - 1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,55130.0
Date:,"Sat, 17 Feb 2018",Prob (F-statistic):,0.0
Time:,12:20:28,Log-Likelihood:,-6964.8
No. Observations:,2000,AIC:,13940.0
Df Residuals:,1996,BIC:,13960.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
gender,1.5291,0.351,4.357,0.000,0.841,2.217
age,3.0739,0.036,84.616,0.000,3.003,3.145
exercise,1.7376,0.122,14.293,0.000,1.499,1.976
hours,2.2837,0.042,54.672,0.000,2.202,2.366

0,1,2,3
Omnibus:,140.087,Durbin-Watson:,2.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,464.795
Skew:,-0.302,Prob(JB):,1.18e-101
Kurtosis:,5.283,Cond. No.,40.4
