# Pandas Dataframe
* A table similar to that in Excel or SQL.
* Unlike a numpy 2D array, each column can have its own datatype.
* Terminology: 2D in numpy is a matrix, in Pandas its a dataframe

This notebook will demonstrate a few of the most common operations that can be performed on a Dataframe.

[Pandas Dataframe Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

In [1]:
import pandas as pd
print("Pandas Version {}".format(pd.__version__))
import numpy as np
print("Numpy Version {}".format(np.__version__))

Pandas Version 0.22.0
Numpy Version 1.14.1


### Use Titantic Dataset
See [codebook](https://www.kaggle.com/c/titanic/data) (aka Data Dictionary)

In [2]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')
all_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [68]:
# although it does not matter here, some Pandas operations will
# run more quickly if the dataframe is sorted on it index
all_data.sort_index(inplace=True)

In [15]:
# rows and columns
n_rows, n_cols = all_data.shape
print((n_rows, n_cols))

(891, 12)


In [8]:
# len gives number of rows
len(all_data)

891

In [12]:
# the value to predicted is a column vector, commonly denoted as y
y = all_data['Survived']
type(y)

pandas.core.series.Series

In [20]:
# A column vector with 891 rows
y.shape

(891,)

In [13]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [14]:
y.value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [18]:
# as per the codebook (i.e. Data Dictionary) 1 means survived
percent_survived = y.eq(1).sum() / n_rows
percent_survived

0.3838383838383838

In [19]:
# the 2D dataframe to be used as input is commonly called X
# and is every column other than the column to be predicted
X = all_data.drop('Survived', axis=1)
X.shape

(891, 11)

In [22]:
# Examine datatypes as inferred by read_csv
X.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [23]:
# object usually referes to text data, but is Sex really text?
X['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [28]:
# Sex is a categorical value, correct its datatype
gender = X['Sex'].astype('category')
gender.dtype

CategoricalDtype(categories=['female', 'male'], ordered=False)

In [31]:
# modify dataframe
X['Sex'] = gender
X['Sex'].head()

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: category
Categories (2, object): [female, male]

In [32]:
# Pclass refers to 1st, 2nd or 3rd class, is it really an integer?
X['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [38]:
# Pclass is an ordered categorical variable
pclass = pd.Categorical(X['Pclass'], ordered=True, categories=[1, 2, 3])
pclass

[3, 1, 3, 1, 3, ..., 2, 1, 3, 1, 3]
Length: 891
Categories (3, int64): [1 < 2 < 3]

In [39]:
X['Pclass'] = pclass

In [40]:
X.dtypes

PassengerId       int64
Pclass         category
Name             object
Sex            category
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin            object
Embarked         object
dtype: object

In [41]:
X['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [45]:
# Embarked is an unordered categorical variable
X['Embarked'] = pd.Categorical(X['Embarked'])
X['Embarked'].dtype

CategoricalDtype(categories=['C', 'Q', 'S'], ordered=False)

In [46]:
X.dtypes

PassengerId       int64
Pclass         category
Name             object
Sex            category
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin            object
Embarked       category
dtype: object

In [54]:
# Name, Ticket and Cabin are text (i.e. str) so no need to convert
# Note, a Series dtype for str is represented as 'O' for object
X['Cabin'].dtype

dtype('O')

In [62]:
# Age really is a float
X['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [66]:
# But it should be noted that a Series that contains integers and a 
# missing value will be represented as a float
s = pd.Series(data=range(10))
print(s.dtype)

int64


In [67]:
# assign a missing value to the series
# now the series type is float
s[3] = np.NaN
print(s.dtype)

float64


In [71]:
# the "data" for X is in X.values
type(X.values)

numpy.ndarray

In [72]:
X.values.shape

(891, 11)

In [76]:
X.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [77]:
X.index

RangeIndex(start=0, stop=891, step=1)

In [78]:
X.axes

[RangeIndex(start=0, stop=891, step=1),
 Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
        'Ticket', 'Fare', 'Cabin', 'Embarked'],
       dtype='object')]

### Data Science Ops

In [79]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Pclass         891 non-null category
Name           891 non-null object
Sex            891 non-null category
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null category
dtypes: category(3), float64(2), int64(3), object(3)
memory usage: 58.7+ KB


In [75]:
X.values[0:2, :]

array([[1, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0, 'A/5 21171',
        7.25, nan, 'S'],
       [2, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
        'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C']],
      dtype=object)