# Penguin Data
---
**Author:** Robert Kelley  
**Version:** 2.0  
**Semester:** Spring 2021  
**Summary:**  

I developed this notebook to so we could walk through the various functions from Pandas for getting descriptive statistics on a dataset.  The dataset for this notebook was obtained from: https://github.com/allisonhorst/palmerpenguins.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import scipy.stats as stats

## Read the dataset / Quick look at data

In [None]:
df = pd.read_csv('penguins_size.csv')

In [None]:
df.head()

In [None]:
male = df[df['sex']=='MALE']

In [None]:
female = df[df['sex']=='FEMALE']

In [None]:
len(female)

In [None]:
len(male)+len(female)

In [None]:
df.describe()

## Measures of Centrality
How the data is clustered.

In [None]:
df['culmen_length_mm'].median()

In [None]:
df.head()

In [None]:
df.mode()

### Arithmetic Mean

In [None]:
df.mean()

In [None]:
female['culmen_length_mm'].mean()

### Geometric Mean

In [None]:
values = [3.5, 75, 653, 12]
mymean =sum(values)/len(values)
print(mymean)
print(stats.gmean(values))

### Harmonic Mean

In [None]:
rates = [15000, 16000, 28000, 21000, 30000]
mymean =sum(rates)/len(rates)
print(mymean)
print(stats.hmean(rates))

## Measures of Dispersion
How spread out our data are.

### Range

In [None]:
df['culmen_length_mm'].max()-df['culmen_length_mm'].min()

### Interquartile Range

In [None]:
df.culmen_length_mm.quantile(.75)-df.culmen_length_mm.quantile(.25)

### Variance

In [None]:
df.culmen_length_mm.var()

### Standard Deviation

In [None]:
df.culmen_length_mm.std()

## Correlation

In [None]:
df.corr()

## Using Pandas

In [None]:
p = pd.read_csv('penguins_lter.csv')

We can look the data with .sample(), .head() or .tail()

In [None]:
p.sample(5)

View the column names

In [None]:
p.columns

Group by various columns in the dataset to determine how many different values are in the categorical variables.

In [None]:
p['Species'].groupby(p['Species']).count()

Drop columns we don't need.

In [None]:
p.drop(['Sample Number', 'Region','Stage', 'Individual ID', 'Comments'], axis=1, inplace=True)

In [None]:
p.columns

Rename the columns to names that are easier to work with.

In [None]:
col_names = ['study_name',
             'species',
             'island',
             'clutch_completion',
             'date_egg',
             'bill_length',
             'bill_depth',
             'flipper_length',
             'body_mass',
             'sex',
             'delta_15',
             'delta_13'
            ]

In [None]:
p.columns = col_names

In [None]:
p.groupby(p.species).count()

Here was my original code.  Turns out, I didn't need to specify the last three columns because they are automatically ignored.

In [None]:
#p[['species','t1','t2','t3']] = p.species.str.split(expand=True)
#p.drop(['t1','t2','t3'],axis=1, inplace=True)

In [None]:
p['species']=p.species.str.split(expand=True)

In [None]:
p.species

In [None]:
p.head()

I didn't mean to drop the measurements on the columns. Here we are renaming just specific columns in the data set.

In [None]:
p.rename(columns={
    'bill_length':'bill_length_mm',
    'bill_depth': 'bill_depth_mm',
    'flipper_length': 'flipper_length_mm',
    'body_mass': 'body_mass_g'
}, inplace=True)

Let's look at the data types.

In [None]:
p.dtypes

We have several strings, and float64. date_egg should be a date.

In [None]:
p.date_egg = pd.to_datetime(p.date_egg)

In [None]:
p.dtypes

I want to add a new column for day of the study the penguin was observed. We first subtract date_egg from the earliest date in the column, convert to string, strip off the 'days' word from the end and convert to an int.

In [None]:
p['study_day']=p.date_egg-p.date_egg.min()
p['study_day']=p['study_day'].astype(str)
p['study_day']=p['study_day'].str[:-4]
p['study_day']=p['study_day'].astype(int)

In [None]:
p.study_day

We can do most of manipulations with chaining (except the last one).

In [None]:
#p['study_day']=p.date_egg-p.date_egg.min().astype(str).str[:-4]
#p['study_day']=p['study_day'].astype(int)

In [None]:
len(p)

Let's get rid of all the rows that have missing data in certain columns.

In [None]:
p = p[p.bill_length_mm.notna()]

In [None]:
p = p[p.delta_15.notna()]

In [None]:
p['sex'].groupby(p.sex).count()

Looks like we have an errant value in 'sex'. We need to remove it. **This is the code I couldn't remember to get the index of dataframe row and then drop it**

In [None]:
i = p[p.sex=='.'].index
p.drop(i, inplace=True)

.info() gives us some summary meta data.  I noticed that we are missing the sex for 5 penguins.

In [None]:
p.info()

In [None]:
p.sex.groupby(p.sex.isna()).count()

Let's get the index for the rows were penguin was empty and investigate.

In [None]:
i = p[p.sex.isna()].index
print(i)
p[p.index==9]

NaN (not a number) means its empty.  Let's just drop these too.

In [None]:
p.drop(i, inplace=True)
p.info()

Looks good. Now we can save.

In [None]:
p.to_csv('processed_penguins.csv', index=False)

In [None]:
p.head()

Let's split out a separate data set for adelie penguins to save.  First, we subset the data by 'Adelie', drop the 'species' column, and then save to file.

In [None]:
adelie = p[p['species']=='Adelie']
adelie.drop(['species'], axis=1,inplace=True)
adelie.to_csv('adelie.csv', index=False)

In [None]:
adelie.head()

I would like to be able to save data sets for each species. This code gets the types with a groupby then it loops through the index (which contains the species) and subsets the data, drops the 'species' column and saves to a file.

In [None]:
types = p.species.groupby(p.species).count()
for t in types.index:
    locals()[t.lower()]=p[p['species']==t]
    locals()[t.lower()].drop(['species'], axis=1,inplace=True)
    locals()[t.lower()].to_csv(t.lower()+'.csv', index=False)

## T-Test
Let's run a T-Test on the data to see if there is a significant different between mean flipper length for male and female penguins. First, we have two different ways to subset the data.

In [None]:
male = p[p['sex']=='MALE']
female = p.query("sex == 'FEMALE'")

Then we can use ttest_ind from statsmodels.stat to do the T-Test.

In [None]:
stats.ttest_ind(male.flipper_length_mm, female.flipper_length_mm)

The p value is less than .05 significance level so we can conclude that the mean flipper length between males and females in this sample are different.