<center> <img src="https://github.ccs.neu.edu/caglar/DS3000/blob/master/img/ds3000.png?raw=true"> </center>

<center> <h1> Week 3 - Day 2 </h1> </center>

<center> <h2> Part 1: Data Processing with pandas</h2></center>

## Outline
1. <a href='#1'>pandas</a>
2. <a href='#2'>pandas **`Series`**</a>
3. <a href='#3'>Creating a **`Series`** with Custom Indices</a>
4. <a href='#4'>Initializing **`Series`** with Dictionaries</a>
5. <a href='#5'>Series Methods</a>
6. <a href='#6'>Iterating through a Series object</a>



<a id="1"></a>

## 1. pandas
* NumPy’s `array` is optimized for homogeneous numeric data that’s accessed via integer indices
* Big data applications must support mixed data types, customized indexing, missing data, data that’s not structured consistently and data that needs to be manipulated into forms appropriate for the databases and data analysis packages you use
* **Pandas** is the most popular library for dealing with such data

* Two key collections 
    * **`Series`** for one-dimensional collections 
    * **`DataFrames`** for two-dimensional collections

* NumPy and pandas are intimately related
    * **`Series`** and **`DataFrame`**s use **`array`s** “under the hood” 
    * **`Series`** and **`DataFrame`**s are valid arguments to many NumPy operations
    * **`array`**s are valid arguments to many **`Series`** and **`DataFrame`** operations

<a id="2"></a>

## 2. pandas **`Series`**
* An enhanced one-dimensional `array`
* Supports custom indexing, including even non-integer indices like strings
* Offers additional capabilities that make them more convenient for many data-science oriented tasks
    * `Series` may have missing data
    * Many `Series` operations ignore missing data by default

In [1]:
import pandas as pd #pd is the conventional alias for the library

### 2.1. Define a `Series` with Default Indices
* Use the **pd.Series(data, index=index)** format
  * where index is an optional argument, and data can be one of many entities.
* By default, a `Series` has integer indices numbered sequentially from 0

In [2]:
grades = pd.Series([87, 100, 91])

In [3]:
grades

0     87
1    100
2     91
dtype: int64

In [4]:
grades[1]

100

In [5]:
grades.values

array([ 87, 100,  91], dtype=int64)

In [6]:
grades.index

RangeIndex(start=0, stop=3, step=1)

<a id="3"></a>

## 3. Creating a `Series` with Custom Indices
* Can specify custom indices with the `index` keyword argument


In [7]:
import numpy as np

grades_array = np.array([87, 100, 91])

In [8]:
grades = pd.Series(grades_array, index = ['Harry', 'Hermione', 'Ron'])

In [9]:
grades

Harry        87
Hermione    100
Ron          91
dtype: int32

### 3.1. Accessing Elements of a `Series` Via Custom Indices
* Can access individual elements via square brackets containing a custom index value
* Use the **Series_Name[index_value]** notation, similar to Dictionaries

In [10]:
grades

Harry        87
Hermione    100
Ron          91
dtype: int32

In [11]:
grades["Hermione"]

100

<center> <img src="https://thumbs.gfycat.com/DarlingAmusingGangesdolphin-size_restricted.gif"> </center>

In [12]:
grades["Hermione"] -= 5

In [13]:
grades

Harry       87
Hermione    95
Ron         91
dtype: int32

* If custom indices are strings that could represent valid Python identifiers, pandas automatically adds them to the `Series` as attributes
* So these attributes can be used to access elements of a Series

In [14]:
grades.Hermione

95

<a id="4"></a>

## 4. Initializing Series with Dictionaries
* If you initialize a `Series` with a dictionary, its keys are the indices, and its values become the `Series`’ element values

In [None]:
student_dict = {"Harry": 85, "Hermione": 95, "Ron": 91}

In [None]:
grades = pd.Series(student_dict)
grades

<a id="5"></a>

## 5. Iterating through a Series object
* Similar to Dictionaries, for the most part

In [None]:
grades

In [None]:
for grade in grades:
    print(grade, end="  ")

In [None]:
for key in grades.keys():
    print(key, end="  ")

In [None]:
for key, item in grades.iteritems():
    print(key, "got", item)

<a id="6"></a>

  
## 6. Series Methods
* `Series` provides many methods for common tasks including producing various descriptive statistics

In [None]:
grades.count()

In [None]:
grades.mean()

In [None]:
grades.min()

In [None]:
grades.max()

In [None]:
grades.std()

#### describe() method
* `Series` method **`describe`** produces all these stats and more
* The `25%`, `50%` and `75%` are **quartiles**:
    * `50%` represents the median of the sorted values.
    * `25%` represents the median of the first half of the sorted values.
    * `75%` represents the median of the second half of the sorted values.
* For the quartiles, if there are two middle elements, then their average is that quartile’s median
* We'll cover descriptive statistics later in the semester!

In [None]:
grades.describe()