# A brief introduction to Pandas
*This tutorial is largely inspired by the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook) and is released under the MIT license.*

*Some of the exercises proposed here as well as many others can be found on [kaggle](https://www.kaggle.com/python10pm/pandas-75-exercises-with-solutions)*

## Pandas as an extension of NumPy better suited for experimental data

- Designed to be compatible with NumPy operations
- Best to handle input / output of data stored as tables
- Well suited for dealing with mixed data types
- Handle missing values
- Able to perform all table-based operations (in particular filtering)


In [21]:
# Data types can't be mixed in a NumPy array
np.array([1., "hello", False])

array(['1.0', 'hello', 'False'], dtype='<U32')

## Introducing pandas objects

In this section we will review the two main pandas objects:

- pandas `Series`
- pandas `DataFrame`

A classical way to import the pandas library is to use the `pd` alias.

In [23]:
# Basic import
import numpy as np
import pandas as pd

### The Pandas `Series` Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [21]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` contains a sequence of values and a sequence of indices, which we can accessed with the ``values`` and ``index`` attributes.

The ``values`` are simply a familiar NumPy array:

In [22]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The ``index`` is an array-like object of type ``pd.Index``.

In [25]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [26]:
data[1]

0.5

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

### `Series` as a generalized 1-dimensional NumPy array
The essential difference between 1-dimensional NumPy arrays and Series is the presence of the **index**: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [28]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [29]:
data['b']

0.5

We can even use non-contiguous or non-sequential indices:

In [30]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [31]:
data[5]

0.5

This match between an index and a value is reminiscent of Python dictionaries, which can also be used to contruct a pandas `Series` directly:

In [32]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

#### Exercise
Create a pandas `Series` whose index are the unique characters of the following string and the values the number of occurence of each of these characters:

In [24]:
my_string = "I love coding in Python, this is a wonderful language !"
my_string_as_arr = np.array(list(my_string))
# TODO
my_string_as_arr

array(['I', ' ', 'l', 'o', 'v', 'e', ' ', 'c', 'o', 'd', 'i', 'n', 'g',
       ' ', 'i', 'n', ' ', 'P', 'y', 't', 'h', 'o', 'n', ',', ' ', 't',
       'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'w', 'o', 'n', 'd',
       'e', 'r', 'f', 'u', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g',
       'e', ' ', '!'], dtype='<U1')

In [45]:
arr = np.array(list(my_string))
unique_characters = np.unique(arr)

letter_count = pd.Series(np.zeros(len(unique_characters), dtype=int),
                         index=unique_characters)

for letter in arr:
    letter_count[letter] += 1

letter_count

     10
!     1
,     1
I     1
P     1
a     3
c     1
d     2
e     3
f     1
g     3
h     2
i     4
l     3
n     5
o     4
r     1
s     2
t     2
u     2
v     1
w     1
y     1
dtype: int64

#### Pandas Series and NumPy functions

Pandas is designed to work with Numpy, so most of NumPy universal functions (or ufunc i.e. those for element-wise operations) and statistic routines will work on pandas `Series`:

In [34]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [35]:
# TODO: compute a Series containing the exponential value of data
np.exp(data)

a    1.284025
b    1.648721
c    2.117000
d    2.718282
dtype: float64

In [36]:
# TODO: compute the average of the values in data
np.mean(data)

0.625

### The Pandas `DataFrame` Object
A pandas `DataFrame` is a 2-dimensional array of data where both rows and columns are indexed. Let's combine two pandas `Series` into one `DataFrame`.


In [38]:
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312,
             'Illinois': 149995}
area = pd.Series(area_dict)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

Both can be combined in a single 2-dimensional object:

In [39]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [60]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [61]:
states.columns

Index(['population', 'area'], dtype='object')

### DataFrame as specialized dictionary

A ``DataFrame`` can be seen as a map between a column name and ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [62]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [63]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.

In [12]:
data = []
for i in range(3):
    data.append({'a': i, 'b': 2 * i})

pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [65]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [26]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [27]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.865257,0.213169
b,0.442759,0.108267
c,0.04711,0.905718


### Adding a new column to a DataFrame
A `Series` or 1-dimensional numpy array can be added as a new column in a `DataFrame` simply by using a dictionary-like syntax:

In [13]:
df = pd.DataFrame({'age': [25,26,27]}, index=['alice','bob','charlie'])
df

Unnamed: 0,age
alice,25
bob,26
charlie,27


In [16]:
df['age_times_two'] = df['age'] * 2
df["zeros"] = np.array([0, 0, 0])

df

Unnamed: 0,age,age_times_two,zeros
alice,25,50,0
bob,26,52,0
charlie,27,54,0


Several columns from a `DataFrame` can also be combined to form a new one:

In [17]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [20]:
data["density"] = data["pop"] / data["area"]
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


For more complex operations implying for instance multiple columns or custom functions, see [the apply method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html).

#### Exercise
Create a pandas DataFrame with indexes from **1 to 26** with a column "lowercase" containing each letter of the alphabet in lower case and a column "uppercase" containing each letter of the alphabet in upper case. 

In [40]:
alphabet = list('abcedfghijklmnopqrstuvwxyz')

alphabet[:5]

['a', 'b', 'c', 'e', 'd']

In [41]:
alphabet_df = pd.DataFrame(alphabet, index=range(1, 27), columns=["lowercase"])

# alphabet in capital letters
alphabet_capitalized = []
for letter in alphabet:
    alphabet_capitalized.append(letter.upper())

alphabet_df["uppercase"] = alphabet_capitalized

alphabet_df.head(5)

Unnamed: 0,lowercase,uppercase
1,a,A
2,b,B
3,c,C
4,e,E
5,d,D


#### Exercise

Create a chess board dataframe i.e. with column names ranging from "A" to "H" and rows from 1 to 8 using booleans to represent the squares' colors.

In [9]:
col_names = ["A", "B", "C", "D", "E", "F", "G", "H"]

squares = np.zeros(64, dtype=bool)
squares = squares.reshape((8, 8))
for idx, row in enumerate(squares):
    row[idx%2::2] = True

pd.DataFrame(squares,
             columns=col_names,
             index=range(1, 9))

Unnamed: 0,A,B,C,D,E,F,G,H
1,True,False,True,False,True,False,True,False
2,False,True,False,True,False,True,False,True
3,True,False,True,False,True,False,True,False
4,False,True,False,True,False,True,False,True
5,True,False,True,False,True,False,True,False
6,False,True,False,True,False,True,False,True
7,True,False,True,False,True,False,True,False
8,False,True,False,True,False,True,False,True
