# In class exercises - Intro to Pandas Series and DataFrames

## Import libs

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

## First import 'response_time_data.csv' data file
* Contains RTs from 800 trials of a simple detection task from each of 20 subjects
* Organizing into a DataFrame and then saved out in csv format
* The index (row) and column labels are encoded in the csv file, so you'll need to read those in explcitly
* Make sure to have a look at the DataFrame - use the df.head() function

In [2]:
file_name = cwd + '/response_time_data.csv'

df = pd.read_csv(file_name, index_col=0, header=0)
df.head()

NameError: name 'cwd' is not defined

## Now have a look at the data using built in Padas functionality
* Check out the max/min of each row, standard deviation, percentiles, etc.

In [3]:
d = df.describe()
d['Sub0'][1]
# etc...

3492.61432305078

## Are there missing values (NaNs) in the data?

In [4]:
np.sum(np.isnan(df), axis=0)

Sub0      0
Sub1      0
Sub2      0
Sub3      0
Sub4      4
Sub5      0
Sub6      0
Sub7      1
Sub8      0
Sub9      2
Sub10     0
Sub11    11
Sub12     0
Sub13     3
Sub14     3
Sub15     0
Sub16     0
Sub17    15
Sub18     7
Sub19     0
dtype: int64

## What about outliers? Lets define outliers here as > 2 * std away from the mean for each subject
* After you've found the outliers for each subject, replace those values with a np.nan (NaN)

In [None]:
d = df.describe()

# then use the std and mean fields to define the outlier limit for each subject
# then replace with NaNs

## After you've found the outliers and replaced with NaNs for each subject, check out this function:
[pandas.DataFrame.interpolate](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate)

* Use this function to interpolate the missing values for each subject (do not interpolate across subjects!)
* Just use linear interpolation...

In [None]:
df.interpolate

## You can explore the "Missing Values" page for Pandas to figure out other ways of filling in missing values and outliers

[page is here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data)

* Figure out how to replace the outliers with the mean of each subject

## Use the Pandas.DataFrame.Sample function to generate bootstrapped confidence intervals for the data from subject 11

[see this page for Samples](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.sample.html)


* Resample Sub11's data with replacement, each time pulling N samples (800 in this case)
* Generate a distribution of means across all resamples
* Compute 95% confidence intervals using:

[this page for quantiles](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html)