# The DataFrame
The vast majority of the work as a data analyst will be in tabular - rows/columns form. The DataFrame is the primary pandas data structure and can be thought of as a collection of series. Each row and column of the dataframe are indexed. We simply call the column indexes - column names. Operations on DataFrames can be applied to all elements or by row or by column. The technical term axis is used to refer to the horizontal and vertical components of the frame. 

The row axis is numbered 0 and the column axis is numbered 1, which is convention borrowed from numpy where ndarrays can have limitless number of axes beginning with 0. The `axis` argument shows up in most all DataFrame methods, meaning you can choose to do the operation over the columns or the rows.

And just as importantly, alignment of indices silently takes place behind the scenes, so care needs to be taken when operating on 2 diffeent dataframes at the same time

### The multiple ways to construct a DataFrame
There are a number of ways to construct a DataFrame by hand. We will only cover a couple here because it is quite rare that you actually need to construct a DataFrame by hand as mostly you will be reading external flat files, getting data from the web or reading from relational databases. Nonetheless, it does occur so here we go!

In [292]:
# Lets import our packages.
import pandas as pd
import numpy as np

In [8]:
# create dataframe from a dictionary of lists. The keys are the column names NOT the indices
df = pd.DataFrame({'name':['Ted', 'Ned', 'Jed'], 'Phone':['Samsung', 'Samsung', 'IOS'], 'Favorite Number':[99, 7, 4]})
df

Unnamed: 0,Favorite Number,Phone,name
0,99,Samsung,Ted
1,7,Samsung,Ned
2,4,IOS,Jed


## Why did the column order get changed?
Dictionaries are inherently unordered, so there is no guarantee that the order you create your dictionary will be the order the columns come out.

In [10]:
# Let's fix the column order 
df = pd.DataFrame({'name':['Ted', 'Ned', 'Jed'], 'Phone':['Samsung', 'Samsung', 'IOS'], 'Favorite Number':[99, 7, 4]},
                 columns=['name', 'Phone', 'Favorite Number'])
df

Unnamed: 0,name,Phone,Favorite Number
0,Ted,Samsung,99
1,Ned,Samsung,7
2,Jed,IOS,4


In [13]:
# Lets get some info on the columns
# All columns with strings are called objects
# There are several other data types not seen here. 
# Favorite number is a 64 bit integer
df.info() # can get slightly less info with df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
name               3 non-null object
Phone              3 non-null object
Favorite Number    3 non-null int64
dtypes: int64(1), object(2)
memory usage: 152.0+ bytes


In [20]:
# Get some summary statistics.
df.describe(include='all') # without 'all' only numeric columns would be described

Unnamed: 0,name,Phone,Favorite Number
count,3,3,3.0
unique,3,2,
top,Jed,Samsung,
freq,1,2,
mean,,,36.666667
std,,,54.003086
min,,,4.0
25%,,,5.5
50%,,,7.0
75%,,,53.0


In [54]:
# One more common method of creating a dataframe by hand
# Thats right, use numpy, but this time give two dimensions
df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,0.490387,0.907717,0.31299,0.97576,0.416036
1,0.073339,0.056285,0.585304,0.926261,0.019517
2,0.072365,0.857369,0.778542,0.440389,0.624256
3,0.640304,0.485401,0.191172,0.346789,0.608614
4,0.595921,0.28114,0.10607,0.096534,0.322405
5,0.72123,0.106494,0.199812,0.945397,0.20808
6,0.005931,0.51707,0.840638,0.351816,0.998252
7,0.261995,0.939195,0.901965,0.215306,0.963369
8,0.963691,0.006193,0.864401,0.694629,0.947458
9,0.441066,0.024593,0.833346,0.321675,0.585262


## Your data is horrible. Can we speed it up and get some real data?
Yes, there is a small baseball.csv dataset in the data subdirectory. We can use read_csv, one of the many pandas facilities to import data into a dataframe

In [59]:
# There are numerous options to read_csv but this is already a clean dataset so we don't use them now
df = pd.read_csv('data/baseball.csv')

In [60]:
#lets get a small glimpse of it
df.head(10)

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,G,PA,AB,H,1B,2B,3B,HR,TB,BB,UBB,IBB,HBP,SF,SH
0,Gregg Zaun,1,3750000,0,2,85,288,245,58,40,12,0,6,88,38,37,1,1,1,3
1,Henry Blanco,2,3175000,0,2,54,128,120,35,29,3,0,3,47,6,5,1,0,0,2
2,Moises Alou,7,7500000,0,7,15,54,49,17,15,2,0,0,19,2,2,0,2,1,0
3,Corey Patterson,9,3000000,0,8,123,392,366,75,46,17,2,10,126,16,16,0,1,4,5
4,Rod Barajas,10,700000,0,2,100,377,349,87,53,23,0,11,143,17,17,0,7,4,0
5,Rich Aurilia,12,4500000,0,3,134,440,407,115,83,21,1,10,168,30,26,4,1,2,0
6,Yorvit Torrealba,13,3000000,0,2,68,261,236,58,35,17,0,6,93,12,12,0,5,3,5
7,Luis Castillo,14,6250000,0,4,85,359,298,73,62,7,1,3,91,50,48,2,2,2,7
8,Marlon Anderson,16,1050000,0,11,87,151,138,29,22,6,0,1,38,9,9,0,0,2,2
9,Placido Polanco,17,4600000,0,4,138,629,580,178,133,34,3,8,242,35,33,2,6,4,4


In [61]:
# Lets retrieve the columns
df.columns

Index(['PLAYER', 'Record_ID#', 'SALARY', 'ROOKIE', 'POS', 'G', 'PA', 'AB', 'H',
       '1B', '2B', '3B', 'HR', 'TB', 'BB', 'UBB', 'IBB', 'HBP', 'SF', 'SH'],
      dtype='object')

In [167]:
#Lets get the info and description like we did with the previous dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337 entries, 0 to 336
Data columns (total 20 columns):
PLAYER         337 non-null object
Record_ID#     337 non-null int64
SALARY         337 non-null int64
ROOKIE         337 non-null int64
POS            337 non-null int64
Games          337 non-null int64
PA             337 non-null int64
AB             337 non-null int64
H              337 non-null int64
1B             337 non-null int64
2B             337 non-null int64
3B             337 non-null int64
HR             337 non-null int64
Total Bases    337 non-null int64
BB             337 non-null int64
UBB            337 non-null int64
IBB            337 non-null int64
HBP            337 non-null int64
SF             337 non-null int64
SH             337 non-null int64
dtypes: int64(19), object(1)
memory usage: 52.7+ KB


In [None]:
df.describe()

## Change columns names
Many of these column names are quite cryptic and meaningless if you don't know basball. Lets change G to Games and TB to Total Bases

In [63]:
# Make a couple column name changes
df.rename(columns={'TB':'Total Bases', 'G':'Games'}, inplace=True)

In [65]:
df.head()

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
0,Gregg Zaun,1,3750000,0,2,85,288,245,58,40,12,0,6,88,38,37,1,1,1,3
1,Henry Blanco,2,3175000,0,2,54,128,120,35,29,3,0,3,47,6,5,1,0,0,2
2,Moises Alou,7,7500000,0,7,15,54,49,17,15,2,0,0,19,2,2,0,2,1,0
3,Corey Patterson,9,3000000,0,8,123,392,366,75,46,17,2,10,126,16,16,0,1,4,5
4,Rod Barajas,10,700000,0,2,100,377,349,87,53,23,0,11,143,17,17,0,7,4,0


## The [ ] is completely different for DataFrames than for series
The bracket's primary use is to retrieve a column(s) from a dataframe. Simply write the name of the column into the brackets and out pops a series

In [79]:
# Get the games column
games = df['Games']
games.head()

0     85
1     54
2     15
3    123
4    100
Name: Games, dtype: int64

In [81]:
#How many players, played more than 100 games
(games > 100).sum() / games.size

0.62314540059347179

In [84]:
# Are there any players with 0 games played
(games == 0).any()

False

In [88]:
# Take the bottom quartile of games
games_low = games.sort_values().iloc[:games.size//4]
games_low.size

84

In [89]:
games.max()

163

## Retreive multiple columns of a dataframe
Inside the brackets insert a list of columns you want

In [91]:
df[['PA', 'AB', 'HR']].head()

Unnamed: 0,PA,AB,HR
0,288,245,6
1,128,120,3
2,54,49,0
3,392,366,10
4,377,349,11


In [92]:
# Get a one column data frame and not a series
# Just put the column name in a one item list
df[['H']].head()

Unnamed: 0,H
0,58
1,35
2,17
3,75
4,87


In [94]:
# Get only columns that have length of 2. 
# could come in handly if you have lots of columns that you don't want to write out
df[[col for col in df.columns if len(col) == 2]].head()

Unnamed: 0,PA,AB,1B,2B,3B,HR,BB,SF,SH
0,288,245,40,12,0,6,38,1,3
1,128,120,29,3,0,3,6,0,2
2,54,49,15,2,0,0,2,1,0
3,392,366,46,17,2,10,16,4,5
4,377,349,53,23,0,11,17,4,0


## Row index lookup
Remember all that time we just spent on looking up Series by the index. It's still important here, just in this example there isn't an interesting index, its simply just a 0 indexed row

## Back to being confused [?]
Even though it looks as though brackets are going to be used just for column lookups, they can still slice the dataframe by the rows if given slice notation

In [100]:
# slice row 20 to 40 by 2
df[20:40:2]

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
20,Eric Hinske,33,800000,0,9,133,432,381,94,52,21,1,20,177,47,43,4,3,1,0
22,Ivan Rodriguez,36,12379883,0,2,111,429,398,110,80,20,3,7,157,23,21,2,3,2,3
24,Freddy Sanchez,38,4150000,0,4,145,608,569,154,117,26,2,9,211,21,20,1,4,6,8
26,Damion Easley,40,950000,0,4,113,347,316,85,67,10,2,6,117,19,19,0,7,3,2
28,Paul Konerko,42,12000000,0,3,122,514,438,105,63,19,1,22,192,65,61,4,7,4,0
30,Gerald Laird,48,1600000,0,2,95,381,344,95,65,24,0,6,137,23,21,2,6,4,4
32,Nick Swisher,59,3600000,0,3,151,588,497,109,63,21,1,24,204,82,76,6,4,4,1
34,Rob Bowen,63,410000,0,2,37,98,91,16,9,5,1,1,26,4,4,0,1,0,2
36,Akinori Iwamura,73,2400000,0,4,152,707,627,172,127,30,9,6,238,70,67,3,4,3,3
38,Luke Scott,82,430000,0,7,148,536,475,122,68,29,2,23,224,53,43,10,5,3,0


In [101]:
# Even though the above works, I always feel more comfotable being explicit and using .iloc
df.iloc[20:40:5]

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
20,Eric Hinske,33,800000,0,9,133,432,381,94,52,21,1,20,177,47,43,4,3,1,0
25,Raul Ibanez,39,5500000,0,7,162,707,635,186,117,43,3,23,304,64,53,11,3,5,0
30,Gerald Laird,48,1600000,0,2,95,381,344,95,65,24,0,6,137,23,21,2,6,4,4
35,Ryan Howard,71,10000000,0,3,162,700,610,153,75,26,4,48,331,81,64,17,3,6,0


In [104]:
# What if you pass a single value
df.loc[40] # loc can take an integer here since the index is all integers. A series is returned

PLAYER         Matt Kemp
Record_ID#            91
SALARY            406000
ROOKIE                 0
POS                    8
Games                154
PA                   657
AB                   606
H                    176
1B                   115
2B                    38
3B                     5
HR                    18
Total Bases          278
BB                    46
UBB                   40
IBB                    6
HBP                    1
SF                     3
SH                     1
Name: 40, dtype: object

In [105]:
df.loc[[40]] #returns a dataframe when passed a list

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
40,Matt Kemp,91,406000,0,8,154,657,606,176,115,38,5,18,278,46,40,6,1,3,1


## Lets make the index a bit more interesting
Right now the row index does not give us any extra information. Lets go ahead and assign the PLAYER column as the index, which will remove it from.

In [109]:
# Before we assign PLAYER as an index, lets see if it is unique
# Some new notation here: can use dot notation to get column
# I rarely ever do because it doesn't work with columns with spaces
# But here it is for completeness
df.PLAYER.unique().size, df.shape[0], df.PLAYER.value_counts().max() # the two numbers match so they must be unique
# The max count is one which also confirms uniqueness

(337, 337, 1)

In [119]:
# Set the index
df2 = df.set_index('PLAYER')

In [123]:
df2.loc[['Matt Kemp']]

Unnamed: 0_level_0,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
PLAYER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Matt Kemp,91,406000,0,8,154,657,606,176,115,38,5,18,278,46,40,6,1,3,1


In [124]:
# which is easier than
df[df['PLAYER'] == 'Matt Kemp']

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
40,Matt Kemp,91,406000,0,8,154,657,606,176,115,38,5,18,278,46,40,6,1,3,1


In [127]:
%timeit df2.loc['Matt Kemp']

10000 loops, best of 3: 78.8 µs per loop


In [128]:
%timeit df[df['PLAYER'] == 'Matt Kemp']

1000 loops, best of 3: 486 µs per loop


In [132]:
# Slice by index. Typical case is with timestamps
df2['Matt Kemp':'Pablo Ozuna']

Unnamed: 0_level_0,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
PLAYER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Matt Kemp,91,406000,0,8,154,657,606,176,115,38,5,18,278,46,40,6,1,3,1
Andre Ethier,101,424500,0,9,140,596,525,160,97,38,5,20,268,59,59,0,4,7,1
Alex Gordon,121,406000,0,5,134,571,493,128,76,35,1,16,213,66,61,5,6,5,1
Morgan Ensberg,122,1750000,0,5,27,80,74,15,14,0,0,1,18,6,6,0,0,0,0
Mark DeRosa,123,4750000,0,4,149,593,505,144,90,30,3,21,243,69,69,0,9,8,2
Andruw Jones,125,14726910,0,8,69,238,209,33,21,8,1,3,52,27,27,0,1,1,0
Jim Edmonds,127,8000000,0,8,109,401,340,80,39,19,2,20,163,55,52,3,2,3,1
Albert Pujols,128,13870949,0,3,147,641,524,187,106,44,0,37,342,104,70,34,5,8,0
Luis Gonzalez,129,2000000,0,7,135,387,341,89,54,26,1,8,141,41,40,1,0,5,0
Brian Schneider,132,4900000,0,2,105,384,335,86,67,10,0,9,123,42,33,9,1,2,4


## Completeness .iat and .at
Remembering all the way back from this morning .iat and .at are analogous to .iloc and .loc except that they can only get one specific element (a scalar) by either integer position or label

In [142]:
#iat
df.iat[99,8], df2.at['Pedro Feliz', 'HR']

(115, 14)

## Boolean slicing 2: The sequel is better this time
Just like we used boolean series and arrays to slice Series last time, we shall use the same syntax here to slice arrays 

In [143]:
df['ROOKIE'].value_counts()

0    330
1      7
Name: ROOKIE, dtype: int64

In [145]:
## Get only the Rookies:
criteria = df['ROOKIE'] == 1
rookies = df[criteria]
veterans = df[~criteria] # note the negation operator

## Compare two dataframes with two different groups
The above commands create two distint groups. Rookies (those with no previous experience in the major leagues) and veterans (those with atleast one year of experience). Lets calculate some metrics of the two groups

In [149]:
# veterans have a huge salary advantage
veterans['SALARY'].mean(), rookies['SALARY'].mean()

(4410987.681818182, 1447095.142857143)

In [150]:
# How about number of home runs (HR)
veterans['HR'].mean(), rookies['HR'].mean() # surprisingly similar number of home runs

(11.781818181818181, 10.285714285714286)

In [166]:
# And how about number of bases on balls (BB)
bb_r = rookies['BB'].mean()
bb_v = veterans['BB'].mean()
'Veterans averaged {:.1f} BB and Rookies averaged {:.1f} BB'.format(bb_v, bb_r)

'Veterans averaged 38.0 BB and Rookies averaged 34.6 BB'

## Make new metrics by adding columns
DataFrames are not meant to be static objects and can be mutated at will. Lets begin by adding a column titled 'SLG' for slugging percentage which is equal to TB / AB

In [179]:
# create new column. Also, tab complete works inside the brackets when writing columns
df['SLG'] = (df['Total Bases'] / df['AB']).round(3)

In [181]:
# Slugging percentage is a standard metric for assessing player skill
df.sort_values('SLG', ascending=False).head()

Unnamed: 0,PLAYER,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,...,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
47,Albert Pujols,128,13870949,0,3,147,641,524,187,106,...,0,37,342,104,70,34,5,8,0,0.653
23,Manny Ramirez,37,18929923,0,7,153,654,552,183,109,...,1,37,332,87,63,24,11,4,0,0.601
220,Ryan Ludwick,646,411000,0,9,150,617,538,161,81,...,3,37,318,62,59,3,8,8,1,0.591
102,Mike Napoli,309,425000,0,2,78,274,227,62,32,...,1,20,133,35,30,5,5,6,1,0.586
244,Chipper Jones,739,12333333,0,5,128,534,439,160,113,...,1,22,252,90,74,16,1,4,0,0.574


## What if you want to insert a column somewhere in the middle and not the end?
pandas supports this with the insert method which you pass to it the integer position of the new column, the column name and the column values

In [184]:
# Lets create a new column that is the log of the salary
insert_num = df.columns.get_loc('SALARY') + 1
log_salary = np.log(df['SALARY'])
df.insert(insert_num, 'LOG SALARY', log_salary)

In [185]:
df.head()

Unnamed: 0,PLAYER,Record_ID#,SALARY,LOG SALARY,ROOKIE,POS,Games,PA,AB,H,...,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
0,Gregg Zaun,1,3750000,15.137266,0,2,85,288,245,58,...,0,6,88,38,37,1,1,1,3,0.359
1,Henry Blanco,2,3175000,14.970818,0,2,54,128,120,35,...,0,3,47,6,5,1,0,0,2,0.392
2,Moises Alou,7,7500000,15.830414,0,7,15,54,49,17,...,0,0,19,2,2,0,2,1,0,0.388
3,Corey Patterson,9,3000000,14.914123,0,8,123,392,366,75,...,2,10,126,16,16,0,1,4,5,0.344
4,Rod Barajas,10,700000,13.458836,0,2,100,377,349,87,...,0,11,143,17,17,0,7,4,0,0.41


In [187]:
## I don't like how I forgot to round the log salry. Let me drop in and start again
df.drop('LOG SALARY', axis=1, inplace=True) # remember axis of 1 is for columns

In [188]:
#one more time
insert_num = df.columns.get_loc('SALARY') + 1
log_salary = np.log(df['SALARY']).round(1)
df.insert(insert_num, 'LOG SALARY', log_salary)

In [190]:
#much better
df.head()

Unnamed: 0,PLAYER,Record_ID#,SALARY,LOG SALARY,ROOKIE,POS,Games,PA,AB,H,...,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
0,Gregg Zaun,1,3750000,15.1,0,2,85,288,245,58,...,0,6,88,38,37,1,1,1,3,0.359
1,Henry Blanco,2,3175000,15.0,0,2,54,128,120,35,...,0,3,47,6,5,1,0,0,2,0.392
2,Moises Alou,7,7500000,15.8,0,7,15,54,49,17,...,0,0,19,2,2,0,2,1,0,0.388
3,Corey Patterson,9,3000000,14.9,0,8,123,392,366,75,...,2,10,126,16,16,0,1,4,5,0.344
4,Rod Barajas,10,700000,13.5,0,2,100,377,349,87,...,0,11,143,17,17,0,7,4,0,0.41


## Easy  way to add a constant as a column value

In [194]:
# add a constant. all rows have value of 5
df['constant'] = 5

In [195]:
# reomve a column or row
del df['constant']

In [196]:
#its gone
'constant' in df.columns

False

## Adding new rows?
In rare cases you might need to add individual rows to your dataframe

In [197]:
# Lets add it to df2 that the player name as the index
df2.head()

Unnamed: 0_level_0,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
PLAYER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Gregg Zaun,1,3750000,0,2,85,288,245,58,40,12,0,6,88,38,37,1,1,1,3
Henry Blanco,2,3175000,0,2,54,128,120,35,29,3,0,3,47,6,5,1,0,0,2
Moises Alou,7,7500000,0,7,15,54,49,17,15,2,0,0,19,2,2,0,2,1,0
Corey Patterson,9,3000000,0,8,123,392,366,75,46,17,2,10,126,16,16,0,1,4,5
Rod Barajas,10,700000,0,2,100,377,349,87,53,23,0,11,143,17,17,0,7,4,0


In [198]:
#Lets steal a row to get started
new_row = df2.loc['Moises Alou']

In [199]:
new_row

Record_ID#           7
SALARY         7500000
ROOKIE               0
POS                  7
Games               15
PA                  54
AB                  49
H                   17
1B                  15
2B                   2
3B                   0
HR                   0
Total Bases         19
BB                   2
UBB                  2
IBB                  0
HBP                  2
SF                   1
SH                   0
Name: Moises Alou, dtype: int64

In [200]:
#lets make some changes
new_row['HR'] = 20 # pretty impressive for only 49 atbats

In [201]:
# Now add to the dataframe
df2.loc['Ted Petrou'] = new_row

In [204]:
# There I am making 7.5 million!!!
df2.iloc[-1:-5:-1]

Unnamed: 0_level_0,Record_ID#,SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH
PLAYER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Ted Petrou,7,7500000,0,7,15,54,49,17,15,2,0,20,19,2,2,0,2,1,0
Reggie Willits,947,432500,0,7,64,136,108,21,17,4,0,0,25,21,21,0,0,2,5
Carlos Quentin,944,400000,0,7,130,569,480,138,75,26,1,36,274,66,66,0,20,3,0
Nick Markakis,942,455000,0,9,157,697,595,182,113,48,1,20,292,99,92,7,2,1,0


## What the MultiIndex are you talking about?
It is possible to assign multiple levels of indexing to both the rows and the columns to further give them

In [217]:
# Assign two columns as an index. The outer column is level 0 and inner column is level
df3 = df.set_index(['PLAYER', 'Record_ID#'])

In [218]:
# you can query the inner column like this
df3.xs(1, level=1)

Unnamed: 0_level_0,SALARY,LOG SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
PLAYER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Gregg Zaun,3750000,15.1,0,2,85,288,245,58,40,12,0,6,88,38,37,1,1,1,3,0.359


In [221]:
# and the outer column like this
df3.xs('Nick Markakis', drop_level=False) # keep the original index level

Unnamed: 0_level_0,Unnamed: 1_level_0,SALARY,LOG SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
PLAYER,Record_ID#,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Nick Markakis,942,455000,13.0,0,9,157,697,595,182,113,48,1,20,292,99,92,7,2,1,0,0.491


In [223]:
#getting fancy with multiindxing
df3.loc[('Gregg Zaun', 1), ['ROOKIE','H']]

ROOKIE     0.0
H         58.0
Name: (Gregg Zaun, 1), dtype: float64

## Apply summary statistics to entire frame
As seen with df.describe(), the core summary statistics of the dataframe are returned in one function. We can similary carry out other functions individually here. 

In [226]:
# One very nice property of this dataset is that once the player is in the index, the rest of the columns are
# numeric and will return back a nice result. Asking pandas to sum up a string can break things fast
df3.sum()

SALARY         1.465756e+09
LOG SALARY     4.916100e+03
ROOKIE         7.000000e+00
POS            1.930000e+03
Games          3.693900e+04
PA             1.417280e+05
AB             1.259160e+05
H              3.411200e+04
1B             2.249700e+04
2B             7.023000e+03
3B             6.320000e+02
HR             3.960000e+03
Total Bases    5.427900e+04
BB             1.277400e+04
UBB            1.165700e+04
IBB            1.117000e+03
HBP            1.327000e+03
SF             1.099000e+03
SH             6.010000e+02
SLG            1.371700e+02
dtype: float64

In [227]:
df3.max()

SALARY         2.800000e+07
LOG SALARY     1.710000e+01
ROOKIE         1.000000e+00
POS            1.100000e+01
Games          1.630000e+02
PA             7.630000e+02
AB             6.880000e+02
H              2.130000e+02
1B             1.800000e+02
2B             5.400000e+01
3B             1.900000e+01
HR             4.800000e+01
Total Bases    3.420000e+02
BB             1.220000e+02
UBB            1.090000e+02
IBB            3.400000e+01
HBP            2.700000e+01
SF             1.100000e+01
SH             1.500000e+01
SLG            6.530000e-01
dtype: float64

## What if I wanted to find out who the person was that was best at all these categories
lets find the integer position of these maxes

In [242]:
# This is really nice! Since we put the player name in the index
df3.idxmax()[0]

('Alex Rodriguez', 766)

In [247]:
#now if I wanted to return the rows of the df with these players
df_maxes = df3.loc[df3.idxmax().unique()]

In [248]:
# new feature in pandas
df_maxes.style.highlight_max()

Unnamed: 0_level_0,Unnamed: 1_level_0,SALARY,LOG SALARY,ROOKIE,POS,Games,PA,AB,H,1B,2B,3B,HR,Total Bases,BB,UBB,IBB,HBP,SF,SH,SLG
PLAYER,Record_ID#,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alex Rodriguez,766,28000000.0,17.1,0,5,138,594,510,154,86,33,0,35,292,65,56,9,14,5,0,0.573
Geovany Soto,205,401000.0,12.9,1,2,140,563,494,141,81,35,2,23,249,62,56,6,2,5,0,0.504
Marlon Anderson,16,1050000.0,13.9,0,11,87,151,138,29,22,6,0,1,38,9,9,0,0,2,2,0.275
Justin Morneau,906,8400000.0,15.9,0,3,163,712,623,187,113,47,4,23,311,76,60,16,3,10,0,0.499
Jose Reyes,790,4375000.0,15.3,0,6,159,763,688,204,132,37,19,16,327,66,58,8,1,3,5,0.475
Dustin Pedroia,591,457000.0,13.0,0,4,157,726,653,213,140,54,2,17,322,50,49,1,7,9,7,0.493
Ichiro Suzuki,763,17102100.0,16.7,0,9,162,749,686,213,180,20,7,6,265,51,39,12,5,4,3,0.386
Ryan Howard,71,10000000.0,16.1,0,3,162,700,610,153,75,26,4,48,331,81,64,17,3,6,0,0.543
Albert Pujols,128,13870900.0,16.4,0,3,147,641,524,187,106,44,0,37,342,104,70,34,5,8,0,0.653
Adam Dunn,21,13000000.0,16.4,0,7,158,651,517,122,59,23,0,40,265,122,109,13,7,5,0,0.513


## Summarize by row instead of by common
The baseball data is not ideal for running statistics by row since each column represents something completely different but there will be plenty of times when you do want summary statistics per row. This is where we finally get to use the axis parameter in all these statistical methods

In [251]:
#Sum up every column for every player. The number is meaningless but shows that thsese stats
# can be computed very easily by just summing up across rows
df3.sum(axis=1).head()

PLAYER           Record_ID#
Gregg Zaun       1             3750920.459
Henry Blanco     2             3175450.392
Moises Alou      7             7500201.188
Corey Patterson  9             3001222.244
Rod Barajas      10             701203.910
dtype: float64

# Your Turn!
Once again, its time to practice the skills we just covered. This should be a tad bit more exciting than the last problem set

## Problem 1
<span  style="color:green; font-size:16px">Create a dataframe by first creating 2 separate series objects that contain 5 elements each but differ have some but not all of their index values in common. Pass a dictionary of keys that strings with values that are series to the dataframe constructor. What concept does this show that keeps getting reinforced?</span>

In [252]:
# your code here

## Problem 2
<span  style="color:green; font-size:16px">You should have some missing values in the dataframe above if you did problem 1 correctly. Use the count twice - once for each axis(0 or 1). What is this method doing? See if you can get the same numbers by using the sum and the isnull/notnull methods</span>

In [263]:
# your code here

## Problem 3
<span  style="color:green; font-size:16px">Continuing with the DataFrame you created in problem one, check out the fillna method to fill the missing values using the argument "method".</span>

In [271]:
# your code here

## Problem 4
<span  style="color:green; font-size:16px">Re-read the baseball.csv file but this time, inspect the read_csv arguments to do the following: assign the index column PLAYER on read, skip the first 20 rows, keep the header. Once read, rename the 'BB' column to 'walks'. You will be working with this dataset the rest of the problems</span>

In [280]:
# your code here

## Problem 5
<span  style="color:green; font-size:16px">Look up the formula for <a href="https://en.wikipedia.org/wiki/On-base_percentage">OBP</a> calculate it, make it a new column and sort the entire frame based on OBP. Then find the average salaries of the top and bottom 10 players and write them out with 2 decimal places in a nice print statement that use string interpolation via .format</span>

In [281]:
# your code here

## Problem 6
<span  style="color:green; font-size:16px">Drop a column and then drop a row. Make sure you use the axis argument correctly</span>

In [283]:
# your code here

## Problem 7
<span  style="color:green; font-size:16px">Mean normalize the homeruns (HR) column by subtracting the homerun mean and dividing by the standard deviation. Use similar logic as the previous section to determine if homeruns follow a normal distribution </span>

In [284]:
# your code here

## Problem 8
<span  style="color:green; font-size:16px">Split the data into three groups via boolean slicing - those with under 50 games (G), between 50 and 100, and over 100 games. Find the average home runs and salary for each of the groups</span>

In [285]:
# your code here

## Problem 9
<span  style="color:green; font-size:16px">Is there any player with a salary under $2,000,000 with more than 20 Home runs and less than 600 plate appearances (PA)?</span>

In [286]:
# your code here

## Problem 10
<span  style="color:green; font-size:16px">Which player has the lowest number of hits (H) given at least 500 at bats (AB)</span>

In [287]:
# your code here

## Problem 11
<span  style="color:green; font-size:16px">Insert a new column in the 5th index with a random number from numpy</span>

In [288]:
# your code here

## Problem 12
<span  style="color:green; font-size:16px">Do some calculation to get a 100 x 10 dataframe retrieved from the dead center of the </span>

In [289]:
# your code here

## Problem 13
<span  style="color:green; font-size:16px">In this problem you will research the dataframe method append. First create two dataframes. The first will be the top 5 rows and second will be the last 5 rows. Drop one column(not the same column) from each dataframe. Use the dataframe append method to stack the frames one on top of each other. What is the result?</span>

In [290]:
# your code here

## Problem 14
<span  style="color:green; font-size:16px">In this problem we will be researching the pandas concat function (pd.concat). Create two dataframes. The first will consist of the first 2 columns of the baseball dataset and the other will consist of the last 2 columns. Drop several rows(with different indexes) from each dataset using del and iloc. Concatenate them together using pd.concat. Describe what happened</span>

In [291]:
# your code here