# 1. Introduction to Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 

- Certainly among the most important tools for data analysts and data scientists.
- The most popular library for working with tabular data in Python

To import it, simply do:

In [1]:
import pandas as pd

To check the version of Pandas you are running, do:

In [2]:
pd.__version__

'0.23.4'

# 2. From Series to DataFrame

To get started with pandas, you will need to get comfortable with its two workhorses:  **Series** and **Dataframe.**

They provide a solid, easy-to-use basis for most applications.


Every object successfully returned by  Pandas is either  **Series** or **DataFrame**  

**DataFrames** and **Series** are not simply storage containers. Since Pandas treat them similarly, they have built-in support for a variety of data-wrangling operations, such as: 

* Single-level and hierarchical indexing
* Handling missing data
* Arithmetic and Boolean operations on entire columns and tables
* Database-type operations (such as merging and aggregation)
* Plotting individual columns and whole tables
* Reading data from files and writing data to files

## 2.1. **Series**

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.


You can create a simple series from any sequence: **a list, a tuple, or a numpy array** or even **a python dictionary**.

#### From a List

**==> Of numbers**

In [3]:
s1 = pd.Series( [2 , 10 , 12]   )

In [4]:
s1

0     2
1    10
2    12
dtype: int64

**==> Of strings**

In [5]:
s2 = pd.Series( ['Uganda' , 'Mali' , 'Chad' , 'Niger']   )

In [6]:
s2

0    Uganda
1      Mali
2      Chad
3     Niger
dtype: object

**==> Of Objects**

In [7]:
s3 = pd.Series(  [ "£12" , 25 , "Banjul", "50km"  ])

In [8]:
s3

0       £12
1        25
2    Banjul
3      50km
dtype: object

#### From a tuple


In [9]:
s4 = pd.Series( ("malaria" , "tuberculosis", "influenza") )

In [10]:
s4

0         malaria
1    tuberculosis
2       influenza
dtype: object

#### From a numpy array


In [11]:
#Let's import the numpy library first
import numpy as np

In [15]:
a1 = np.arange(0,10,2)
a1

array([0, 2, 4, 6, 8])

In [13]:
s5 = pd.Series(a1)

In [14]:
s5

0    0
1    2
2    4
3    6
4    8
dtype: int64

#### From a python dictionary


In [27]:
d1 = {"Outbreak": 'Ebola' , 
      "City": 'Goma' , 
      "Country": 'DRC' ,
      "Continent": 'Africa'}

In [28]:
d1

{'Outbreak': 'Ebola', 'City': 'Goma', 'Country': 'DRC', 'Continent': 'Africa'}

In [29]:
s6 = pd.Series(d1)

In [30]:
s6

Outbreak      Ebola
City           Goma
Country         DRC
Continent    Africa
dtype: object

As you can notice, there is column on the left always appearing when printing a series.

It's a column index which, by default run from 0 to n-1 where n is the length of the series

In the case of a dictionary, it is automatically replaced by **the key** of the dictionary.

And the **values** of the dictionary are the actual content of the Pandas Series

You can verify it by typing the command below:


In [31]:
s6.index

Index(['Outbreak', 'City', 'Country', 'Continent'], dtype='object')

In [32]:
s6.values

array(['Ebola', 'Goma', 'DRC', 'Africa'], dtype=object)

Series can also be created along with its indices

In [37]:
# An information recorded from a patient during a survey
s7 = pd.Series( ["Traore" , "Senegalese" , "Single" , "Wolof"],
              index = ["Name" , "Nationality" , "Status" , "Language"])

In [36]:
s7

Name               Traore
Nationality    Senegalese
Status             Single
Language            Wolof
dtype: object

In [41]:
#The age of the members of a family in Bouake, Ivory Coast as 
#recorded during a survey
s8 =  pd.Series([12 , 25 , 7 , 58 , 39],
               index = ["Yao" , "Kouassi" , "Senan", "Bony" , "Marguerite"])

In [42]:
s8

Yao           12
Kouassi       25
Senan          7
Bony          58
Marguerite    39
dtype: int64

We can get to each of the terms easily

In [43]:
s8['Yao']

12

In [44]:
s8['Bony']

58

We can check the number of people **below the age of 20**

In [45]:
s8 < 20

Yao            True
Kouassi       False
Senan          True
Bony          False
Marguerite    False
dtype: bool

It returns Boolean values: True where there is a match and False otherwise.
    
But, what if we want to get the actual values

In [46]:
s8[s8<20]

Yao      12
Senan     7
dtype: int64

### Basic Statistics

**==> The sum of all the elements**

In [47]:
s8.sum()

141

**==> The average**

In [48]:
s8.mean()

28.2

**==> The lowest value**

In [49]:
s8.min()

7

**==> The largest value**

In [50]:
s8.max()

58

**==> The variance**

In [51]:
s8.var()

431.70000000000005

**==> The standard deviation**

In [52]:
s8.std()

20.777391559096152

## 2.2. **Dataframe**

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)

It can also simply be viewed as a collection of Series

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [56]:
#The temperature across African cities this summer
d2 = {"City": ['Tomboucto' , 'Thies' , 'Nouackshott','Niamey', 'Douala'],
      
      "Temperature(°C)": [29,32,27,19,35]     }

In [57]:
df1 = pd.DataFrame(d2)

In [63]:
df1

Unnamed: 0,City,Temperature(°C)
0,Tomboucto,29
1,Thies,32
2,Nouackshott,27
3,Niamey,19
4,Douala,35


In [80]:
#The birth rate from several hospitals in Cotonou, Benin on a specific week
d2 = {"Hospital": ["Hopital General" , "Institut Pasteur" , "Clinique Notre-Dame",
                   "Hopital Regional", "Hopital Jamot", "Liberty Clinic"],
    "Birth_Rate" :[ 0.4 , 0.25 , 0.98 , 0.18 , 0.62 , 0.16]
     }

In [81]:
df2   = pd.DataFrame(d2)

In [82]:
df2

Unnamed: 0,Hospital,Birth_Rate
0,Hopital General,0.4
1,Institut Pasteur,0.25
2,Clinique Notre-Dame,0.98
3,Hopital Regional,0.18
4,Hopital Jamot,0.62
5,Liberty Clinic,0.16


To query the content of a dataframe, we can call the different series

In [83]:
df2['Hospital']

0        Hopital General
1       Institut Pasteur
2    Clinique Notre-Dame
3       Hopital Regional
4          Hopital Jamot
5         Liberty Clinic
Name: Hospital, dtype: object

In [84]:
df2['Birth_Rate']

0    0.40
1    0.25
2    0.98
3    0.18
4    0.62
5    0.16
Name: Birth_Rate, dtype: float64

We might decide to add another column. For instance, whether or not there were light in those hospital that day

In [85]:
df2['Light'] = ["Yes" , "Yes" , "No" , "Yes" , "No" , "No"]

In [86]:
df2

Unnamed: 0,Hospital,Birth_Rate,Light
0,Hopital General,0.4,Yes
1,Institut Pasteur,0.25,Yes
2,Clinique Notre-Dame,0.98,No
3,Hopital Regional,0.18,Yes
4,Hopital Jamot,0.62,No
5,Liberty Clinic,0.16,No


We can get a subset of dataframe satisfying certain conditions.

#### Q: What are the hospitals where the birth rate is less than 0.50?

In [87]:
df2[df2['Birth_Rate'] < 0.5]

Unnamed: 0,Hospital,Birth_Rate,Light
0,Hopital General,0.4,Yes
1,Institut Pasteur,0.25,Yes
3,Hopital Regional,0.18,Yes
5,Liberty Clinic,0.16,No


#### Q: What are the hospitals which had light on that day?

In [88]:
df2[df2['Light'] =="Yes"]

Unnamed: 0,Hospital,Birth_Rate,Light
0,Hopital General,0.4,Yes
1,Institut Pasteur,0.25,Yes
3,Hopital Regional,0.18,Yes


Another common form to build a dataframe is through a nested dict of dicts format.

In [90]:
data = {
    'Size(cm)' : {'Paul': 125 , 'John': 175 , 'Thomas': 186 , 'Julio':145 },
    'Weight(kg)' : {'Paul': 68 , 'John': 72 , 'Thomas': 102 , 'Julio' :98},
    'Age': {'Paul': 24 , 'John': 31 , 'Thomas': 18 , 'Julio' : 36}
        }

In [91]:
df3 = pd.DataFrame(data)

In [92]:
df3

Unnamed: 0,Size(cm),Weight(kg),Age
John,175,72,31
Julio,145,98,36
Paul,125,68,24
Thomas,186,102,18
