# III. The Series Data Structure
A Series is similar to a list or an array in Python. It represents a series of values (numeric or otherwise) such as a column of data. A typical series has this form: <font color="red"><b>series = {index-0: value-0, index-1: value-1, index-2: value-2}</b></font>

To get started using a Series, the pandas library needs to be imported into the Python environment.

In [3]:
import pandas as pd

A series can be initialized based on a list or a dictionary. The differences between the two methods are straightforward.<br/> (1) If the series is initialized from a list, its index will be the numerical values 0,1,2, etc by default.<br/> (2) If the series is initialized from a dictionary, the keys of the dictionary will be mapped to the series' indices.

## 3.1 Initialize series from a list
Based on an existing list, a series can be easily initialized by <font color="red"><b>series = pd.Series(list, index=[list of indices])</b></font>. Similar to lists, a series can also carry different types data. Note that the <b>index</b> parameter in the command is not compulsory. The indices will be 0,1,2 etc by default, if the index parameter is not specified.

<font color="blue"><i>E.g.1 Initialize a series with a list of string elements without index specified. The resulting series carries string data.</i></font>

In [6]:
animals = ['Tiger', 'Bear', 'Moose']
animal_series = pd.Series(animals)
animal_series

0    Tiger
1     Bear
2    Moose
dtype: object

<font color="blue"><i>E.g.2 Initialize a series with the given list without index specified. The resulting series carries numerical data.</i></font>

In [7]:
numbers = [1, 2, 3]
num_series = pd.Series(numbers)
num_series

0    1
1    2
2    3
dtype: int64

<font color="blue"><i>E.g.3 Initialize a series with given indices. <b>Compare with the 1st example</b>. Make sure the numbers of elements in the value list and index list are consistent.</i></font>

In [12]:
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

## 3.2 Initialize series from a dictionary
1. A series can also be initialized from an existing dictionary by <font color="red"><b>series = pd.Series(dict)</b></font>.<br/>
2. The indices of the series can be accessed by <font color="red"><b>series.index</b></font> argument.

<font color="blue"><i>E.g.1 Initialize a series from the given dictionary.</i></font>

In [5]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

<b>Extract the indices of the series.</b>

In [13]:
s.index

Index(['India', 'America', 'Canada'], dtype='object')

## 3.3 Access series data
The data <b>values</b> in series can be accessed by its numerical index or actual index.<br/>
1. By numerical index: <font color="red"><b>series.iloc[numerical_id]</b></font>. The numerical indices start from 0.<br/>
2. By actual index: <font color="red"><b>series.loc[actual_id]</b></font>.<br/>

<font color="blue"><i>E.g.1 Extract the data of a given series by its numerical index and actual index.</i></font>

In [27]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
print(s.iloc[3])
print(s.loc['Golf'])

South Korea
Scotland


## 3.4 Append new entries to series
1. Insert a single entry to an existing series: <font color="red"><b>series.loc[index] = value</b></font>.<br/>
2. Insert multiple entries to an existing series: <font color="red"><b>series1.append(series2)</b></font>. A new series needs to be initialized first to store the entires of data to be inserted.

<font color="blue"><i>E.g.1 Insert a single entry to an existing series.</i></font>

In [9]:
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s

0             1
1             2
2             3
Animal    Bears
dtype: object

<font color="blue"><i>E.g.2 Insert multiple entries of data to an existing series by appending a new series. Note that multiple entries in a series can share a same index.</i></font>

In [20]:
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'], 
                                     index=['Cricket', 'Cricket', 'Cricket', 'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
all_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

## 3.5 Miscellaneous series applications
1. Sum up a numerical series with the numpy library: <font color="red"><b>total = np.sum(series)</b></font>
2. Find the length of a series: <font color="red"><b>len(series)</b></font>.

<font color="blue"><i>E.g.1 Sum up a given series with numerical data. The NumPy library can be employed to avoid using loop.</i></font>

In [18]:
import numpy as np

s = pd.Series([100.00, 120.00, 101.00, 3.00])
total = np.sum(s)
total

324.0

<font color="blue"><i>E.g.2 Add 2 to all the numerical data in the series.</i></font>

In [19]:
s += 2
s

0    102.0
1    122.0
2    103.0
3      5.0
dtype: float64

# IV. The DataFrame Data Structure
Dataframe is the most important data structure in analytics projects. In most cases, the data imported externally from spreadsheets will carry the dataframe format. A dataframe can be considered as an aggregation of many series of data. Therefore, the aforementioned data manipulation technics of series are also applicable to dataframes.

### 4.1 Initialize dataframe and access data in dataframe
Dataframes can be initialized with the <b>pd.DataFrame(data, index, <mark style="background-color: yellow;">columns</mark>)</b> command. Compared to the initialization of series, a list of column names needs to be provided to fully define a dataframe.
1. Initialize dataframe from series: <font color="red"><b>df = pd.DataFrame([list_of_series], index = [list_of_indices])</b></font>. The series indices are taken as column names of the dataframe by default.
2. Without providing the "data" argument, an empty dataframe can be declared first for storing data in later use.
3. Access row data: <font color="red"><b>df.iloc[numerical_row_id, :]</b></font> or <font color="red"><b>df.loc[actual_id, :]</b></font>. A series is returned.
4. Access column data: <font color="red"><b>df.iloc[:, numerical_col_id]</b></font> or <font color="red"><b>df.loc[:, col_name]</b></font>. A series is returned.
5. Access dataframe cell: <font color="red"><b>df.iloc[numerical_row_id, numerical_col_id]</b></font> or <font color="red"><b>df.loc[actual_id, col_name]</b></font>.
6. Transpose 

<font color="blue"><i>E.g.1 Initialize a dataframe with given series and extract the data with row index and column index.</i></font>

In [22]:
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevyn
Store 2,5.0,Bird Seed,Vinod


<font color="blue"><b>Extract row data by row index</b>.</font>

In [43]:
print(df.iloc[0, :])
print(df.loc['Store 1',:])

Cost                  22.5
Item Purchased    Dog Food
Name                 Chris
Name: Store 1, dtype: object
         Cost Item Purchased   Name
Store 1  22.5       Dog Food  Chris
Store 1   2.5   Kitty Litter  Kevyn


<font color="blue"><b>Extract column data by column index.</b></font>

In [42]:
print(df.iloc[:, 0])
print(df.loc[:, 'Cost'])

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64
Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64


<font color="blue"><b>Extract cell data by row and column indices.</b></font>

In [46]:
print(df.iloc[0, 0])
print(df.loc['Store 2', 'Name'])

22.5
Vinod


## 4.2 DataFrame Manipulations
In this section, we walk through the following technics of dataframe manipulation with a case study on the Olympics dataset.<br/>
1. Load data from a csv file: <font color="red"><b>df = pd.read_csv("data.csv", parameters)</b></font>.
2. Extract the column names of the dataframe: <font color="red"><b>df.columns</b></font>.
3. Rename <b>all</b> dataframe columns: <font color="red"><b>df.columns = [list_of_col_names]</b></font>. The number of column names should match the actual number of columns of the dataframe.
4. Rename <b>specific</b> dataframe columns: <font color="red"><b>df.rename(columns={"oldName1": "newName1", "oldName2": "newName2"}, inplace = True)</b></font>

<font color="blue"><b>Load dataset from the "olympics.csv" as a dataframe.</b> Set the first column as the index column of the dataframe, and skip the first row.</font>

In [48]:
df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


<font color="blue">Extract the column names.</font>

In [50]:
df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

It can be observed that the columns with numerical names are not named properly! "01", "02", and "03" should be mapped to gold, silver, and bronze medals respectively. The columns of the dataframe are renamed with the following loop.<br/>
Explanation: <b>col[:2]</b> detects the first two characters in a column name.