# Session 2. Pandas Data Structures

## The previous section introduced the Pandas DataFrame and Series objects. These data structures resemble the primitive Python data containers (lists and dictionaries) for indexing and labeling, but have additional features that make working with data easier.


## This session will cover:
1. Loading in manual data
2. The Series object
3. Basic operations on Series objects
4. The DataFrame object
5. Conditional subsetting and fancy slicing and indexing
6. Saving out data”



# 0. Let's load some libraries

In [1]:
import pandas as pd

In [2]:
import numpy as np

# 1. Let's create our own data

## 1.1. Creating a Series

### The Pandas Series is a one-dimensional container, similar to the built-in Python list. 
### A Pandas Series is very similar to a Python list, except each element must be the same dtype. 


In [3]:
mySeries=pd.Series(['banana, apple, coconut,orange'])

In [4]:
mySeries

0    banana, apple, coconut,orange
dtype: object

In [5]:
otherSeries=pd.Series(['4.5','3.5','2.3','0.5','6.7'])

In [6]:
otherSeries

0    4.5
1    3.5
2    2.3
3    0.5
4    6.7
dtype: object

## 1.2. Creating a DataFrame from a dictionary

### The DataFrame is the most common Pandas object. It can be thought of as Python’s way of storing spreadsheet-like data.

### A DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the the most common way of creating a DataFrame by hand.
### The key represents the column name, and the values are the contents of the column.



In [7]:
dictionaryOfScientists={'Name': ['Rosaline Franklin', 'William Gosset','Alexander Flemming','Carl F. Gauss'],
                        'Occupation': ['Chemist', 'Statistician','Physician','Mathematician'],
                        'Born': ['1920-07-25', '1876-06-13','1881-08-06','1777-04-10'],
                        'Died': ['1958-04-16', '1937-10-16','1954-03-11','1855-02-23'],
                        'Age': [37, 61,73,77]}



In [8]:
dictionaryOfScientists

{'Name': ['Rosaline Franklin',
  'William Gosset',
  'Alexander Flemming',
  'Carl F. Gauss'],
 'Occupation': ['Chemist', 'Statistician', 'Physician', 'Mathematician'],
 'Born': ['1920-07-25', '1876-06-13', '1881-08-06', '1777-04-10'],
 'Died': ['1958-04-16', '1937-10-16', '1954-03-11', '1855-02-23'],
 'Age': [37, 61, 73, 77]}

In [9]:
DataFrameOfScientists = pd.DataFrame(dictionaryOfScientists)

In [10]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77


## 1.3. Creating a DataFrame from records

### we can also create a DataFrame from some existing records 

In [11]:
stocksRecords = [(3000, 'Google'), (213, 'Microsoft'), (1250, 'Apple'), (4500, 'Facebook'),(100, 'Tesla')]

In [12]:
stocksRecords

[(3000, 'Google'),
 (213, 'Microsoft'),
 (1250, 'Apple'),
 (4500, 'Facebook'),
 (100, 'Tesla')]

In [13]:
stocksDataFrame=pd.DataFrame.from_records(stocksRecords, columns=['Price', 'Company'])

## 2. Let's select some data

### 2.1. Selection based on column name

In [15]:
DataFrameOfScientists['Name']

0     Rosaline Franklin
1        William Gosset
2    Alexander Flemming
3         Carl F. Gauss
Name: Name, dtype: object

In [16]:
DataFrameOfScientists[['Name','Occupation']]

Unnamed: 0,Name,Occupation
0,Rosaline Franklin,Chemist
1,William Gosset,Statistician
2,Alexander Flemming,Physician
3,Carl F. Gauss,Mathematician


### 2.2. Selection based on filter

In [17]:
# filter to select chemists
chemistFilter=DataFrameOfScientists['Occupation']=='Chemist'

In [18]:
DataFrameOfScientists[chemistFilter]

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37


In [19]:
# filter to select those named Alexander Flemming
alexanderFlemmingFilter=DataFrameOfScientists['Name']=='Alexander Flemming'

In [20]:
DataFrameOfScientists[alexanderFlemmingFilter]

Unnamed: 0,Name,Occupation,Born,Died,Age
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73


In [21]:
# filter to select those named William 
# we use the function str.contains()
williamFilter=DataFrameOfScientists['Name'].str.contains('William')

In [22]:
DataFrameOfScientists[williamFilter]

Unnamed: 0,Name,Occupation,Born,Died,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


In [23]:
# filter to select by age
ageFilter=DataFrameOfScientists['Age']>=60

In [24]:
DataFrameOfScientists[ageFilter]

Unnamed: 0,Name,Occupation,Born,Died,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77


In [25]:
# filter to select by age range [40,70]
ageFilter2=(DataFrameOfScientists['Age']>=40)&(DataFrameOfScientists['Age']<70)

In [26]:
DataFrameOfScientists[ageFilter2]

Unnamed: 0,Name,Occupation,Born,Died,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


### 2.3. Selection based on inverse filter 

In [27]:
DataFrameOfScientists[~williamFilter]

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77


### 2.4. Slicing

In [28]:
DataFrameOfScientists.iloc[1:3]

Unnamed: 0,Name,Occupation,Born,Died,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73


In [29]:
DataFrameOfScientists.iloc[1:4]

Unnamed: 0,Name,Occupation,Born,Died,Age
1,William Gosset,Statistician,1876-06-13,1937-10-16,61
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77


In [30]:
DataFrameOfScientists.iloc[1:4,0:2]

Unnamed: 0,Name,Occupation
1,William Gosset,Statistician
2,Alexander Flemming,Physician
3,Carl F. Gauss,Mathematician


In [31]:
DataFrameOfScientists.iloc[1:4,0:3]

Unnamed: 0,Name,Occupation,Born
1,William Gosset,Statistician,1876-06-13
2,Alexander Flemming,Physician,1881-08-06
3,Carl F. Gauss,Mathematician,1777-04-10


In [32]:
DataFrameOfScientists.iloc[:,0:2]

Unnamed: 0,Name,Occupation
0,Rosaline Franklin,Chemist
1,William Gosset,Statistician
2,Alexander Flemming,Physician
3,Carl F. Gauss,Mathematician


## 3. Let's perform some basic operations 

In [33]:
# we multiply the Price column and save the result as a new column
stocksDataFrame['PriceDoubled']=stocksDataFrame['Price']*2

In [34]:
stocksDataFrame

Unnamed: 0,Price,Company,PriceDoubled
0,3000,Google,6000
1,213,Microsoft,426
2,1250,Apple,2500
3,4500,Facebook,9000
4,100,Tesla,200


In [35]:
# we compute the log of the column price
stocksDataFrame['PriceLog']=np.log(stocksDataFrame['Price'])

In [36]:
stocksDataFrame

Unnamed: 0,Price,Company,PriceDoubled,PriceLog
0,3000,Google,6000,8.006368
1,213,Microsoft,426,5.361292
2,1250,Apple,2500,7.130899
3,4500,Facebook,9000,8.411833
4,100,Tesla,200,4.60517


## 4. Let's perform some basic operations on time variables

In [37]:
# we transform the columns Born and Died to datetime64 format
DataFrameOfScientists['BornDateTimeFormat']=pd.to_datetime(DataFrameOfScientists['Born'])

In [38]:
DataFrameOfScientists['DiedDateTimeFormat']=pd.to_datetime(DataFrameOfScientists['Died'])

In [39]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,Age,BornDateTimeFormat,DiedDateTimeFormat
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37,1920-07-25,1958-04-16
1,William Gosset,Statistician,1876-06-13,1937-10-16,61,1876-06-13,1937-10-16
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73,1881-08-06,1954-03-11
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77,1777-04-10,1855-02-23


In [41]:
DataFrameOfScientists['daysLived']=DataFrameOfScientists['DiedDateTimeFormat']-DataFrameOfScientists['BornDateTimeFormat']

In [42]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,Age,BornDateTimeFormat,DiedDateTimeFormat,daysLived
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37,1920-07-25,1958-04-16,13779 days
1,William Gosset,Statistician,1876-06-13,1937-10-16,61,1876-06-13,1937-10-16,22404 days
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73,1881-08-06,1954-03-11,26514 days
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77,1777-04-10,1855-02-23,28442 days


In [54]:
DataFrameOfScientists['yearsLived']=DataFrameOfScientists['daysLived'].astype('timedelta64[Y]')

In [55]:
DataFrameOfScientists

Unnamed: 0,Name,Occupation,Born,Died,Age,BornDateTimeFormat,DiedDateTimeFormat,daysLived,yearsLived
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37,1920-07-25,1958-04-16,13779 days,37.0
1,William Gosset,Statistician,1876-06-13,1937-10-16,61,1876-06-13,1937-10-16,22404 days,61.0
2,Alexander Flemming,Physician,1881-08-06,1954-03-11,73,1881-08-06,1954-03-11,26514 days,72.0
3,Carl F. Gauss,Mathematician,1777-04-10,1855-02-23,77,1777-04-10,1855-02-23,28442 days,77.0


## 5. Let's export our data

### 5.1. Export as CSV

In [56]:
DataFrameOfScientists.to_csv('./DataFrameOfScientists.csv')