# INFO 212: Data Science Programming 1
___

### Week 4: Getting Start with Pandas
___

### Mon., April 23, and Wed., April 25, 2018
---

**Question:**
- What capabilities does Python provide to make data cleaning and analysis fast and easy? 

**Objectives:**
- Distinguish pandas Series and DataFrame data structures.
- Apply the essential functionality of DataFrame including indexing, selection, filtering.
- Fill or drop missing values in DataFrame.
- Apply functions to DataFrame values by using map.
- Summarize and compute descriptive statistics.

Pandas will be a major tool of interest for data analysis. It
contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

## Load the Data Set We Will Work on

```
df = pd.read_csv("datasets/titanic/train.csv")
df.head()
```

Pandas DataFrame has a function describe() which calculates some descriptive statistics about the data.

```
df.describe()
```

## Introduction to pandas Data Structures

### Series
A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.

```
obj = pd.Series([4, 7, -5, 3])
obj
```

A Series has index and values.

```
obj.values
```

```
obj.index
```

Each column or row in a DataFrame is a Series.

```
type(df['Age'])
```

A Series can be created from dictionary.

```
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
```

The index of a Series can be changed explicitly.

```
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
```

The values of a Series can be reordered by passing a new index.

```
states=['Texas', 'Ohio', 'Oregon', 'Utah']
obj5 = pd.Series(sdata, index=states)
obj5
```

### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. 

How to create a DataFrame object?

```
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
```

```
frame.head()
```

As for Series, If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order.

```
pd.DataFrame(data, columns=['year', 'state', 'pop'])
```

If you pass a column that isn’t contained in the dict, it will appear with missing values in the result.

```
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
frame2
frame2.columns
```

```
frame2
```

A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute.

```
frame2['state']
```

```
frame2.year
```

How to retrieve rows of a Dataframe?

```
frame2.loc['three']
```

```
frame2.iloc[2]
```

How to assign values to a column?

```
frame2['debt'] = [1, 2 ,3] * 2
```

```
frame2
```

When you are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame. If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any holes.

```
debts = pd.Series([1, 2, 3], index = ['one', 'two', 'three'])
frame2['debt'] = debts
```

```
frame2
```

### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index.

```
frame2.index
```

```
frame2.columns
```

```
frame2.columns[2]
```

```
frame2.index.union(frame2.columns)
```

## Essential Functionality

### Reindexing
An important method on pandas objects is reindex, which means to create a new
object with the data conformed to a new index.

```
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
```

```
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
```

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values.

```
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
```

```
obj3.reindex(range(6), method='ffill')
```

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed only a sequence, it reindexes the rows in the result.

```
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame
```

```
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
```

```
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states, fill_value=0)
```

### Dropping Entries from an Axis
Dropping one or more entries from an axis is easy if you already have an index array
or list without those entries. As that can require a bit of munging and set logic.
The drop method will return a new object with the indicated value or values deleted from
an axis.

How to drop the PassengerId column from the DataFrame?

```
df.drop(['PassengerId', 'Survived'], axis = 1)
```

### Indexing, Selection, and Filtering
For both Series and DataFrame, indices can be used for selection and filtering.

How to find the ages for passengers 1-10?

```
df.iloc[:10, :]['Age']
```

How to find all the passengers whose purchased expensive tickets (> 500)?

```
df[df.Fare > 500]
```

#### Selection with loc and iloc

How to retrieve the Age, Sex, and Pclass of passengers 10-20?

```
df.loc[10:20, ['Age', 'Sex', 'Pclass']]
```

```
df.columns.get_loc('Age')
```

```
df.iloc[:10, 4:6]
```

### Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. 

```
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])
s1
```

```
s2
```

```
s1 + s2
```

#### Arithmetic methods with fill values
In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other.

```
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
df2
```

```
df1 + df2
```

```
df1
```

```
df2
```

```
df1.add(df2, fill_value = 0)
```

#### Operations between DataFrame and Series
Broadcasting allows operations between DataFrame and Series on either rows or columns.

How to subtract an array of values from a matrix?

```
arr = np.arange(12.).reshape((3, 4))
arr
arr[0]
arr - arr[0]
```

How to subtract a Series values from a DataFrame?

```
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
```

```
frame - series
```

How to subtract values for each column in DataFrame?

```
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
```

```
series3 = frame['d']
frame
series3
frame.sub(series3, axis='index')
```

### Function Application and Mapping

A frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply method does exactly this:

```
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
np.abs(frame)
```

```
f = lambda x: x.max() - x.min()
frame.apply(f)
```

```
frame.apply(f, axis='columns')
```

```
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
```

Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating-point value in frame. You can do this with apply
map.

```
format = lambda x: '%.2f' % x
frame.applymap(format)
```


The reason for the name applymap is that Series has a map method for applying an
element-wise function.

```
frame['e'].map(format)
```

### Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object.

```obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()```

How to sort the columns for the titanic dataframe?

```df.sort_index(axis=1)```

How to sort the titanic data by Age and Fare in descending order?

```df.sort_values(by=['Age', 'Fare'], ascending=False)```

Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank.

```obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()```

```obj.rank(method='first')```

```# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')```

```frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame
frame.rank(axis='columns')```

### Axis Indexes with Duplicate Labels
Pandas DataFrame allows duplated index or columns

```obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj```

```obj.index.is_unique```

```obj['a']```

```data = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
data.loc['b']```

## Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data.

How many female passengers were there?

```(df.Sex == 'female').sum()```

Who was the oldest passenger?

```df.columns```

```df.iloc[df.Age.idxmax()][['Name', 'Age', 'Sex']]```

### Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs of
arguments. Let’s consider some DataFrames of stock prices and volumes obtained
from Yahoo! Finance.

Download the two pickle files and put them under datasets.

```price = pd.read_pickle('datasets/yahoo_price.pkl')
volume = pd.read_pickle('datasets/yahoo_volume.pkl')```

```price.head()```

```volume.head()``

```print(price.index.min(), price.index.max())```

```returns = price.pct_change()
returns.head()```

The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance.

```returns['MSFT'].corr(returns['IBM'])```

```returns['MSFT'].cov(returns['IBM'])```

```returns.corr()```

```returns.cov()```

Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

```returns.corrwith(returns.IBM)```

Passing a DataFrame computes the correlations of matching column names.

```returns.corrwith(volume)```

### Unique Values, Value Counts, and Membership
Another class of related methods extracts information about the values contained in a
one-dimensional Series.

How many types of Pclass in the titanic data?

```df.Pclass.unique()```

How were the passengers distributed among the three Pclasses?

```df.Pclass.value_counts()```