# Pandas Introduction to Data Structures

## 1. What is Pandas?

![image.png](attachment:image.png)

Python library for data analysis
* Provides fast and flexible data structures designed to work with tabular data
* Built on top of NumPy
* Part of the SciPy ecosystem (Scientific Computing Tools for Python)
    * Integrated well with other Python packages
    * SciPy & StatsModel, Matplotlib & Plotly, Scikit-learn

### Key Pandas Features

* Intuitive data format
* Easy data transformations
* Data visualization
* Ideal tools for typical data engineering task cycle
    * Munging, Cleaning, Analyzing and Modeling data
    * Organizing result for visualization or tabular display

Import Pandas package and check its version.

### Data Structure 
- Pandas supports up to two-dimentions DataFrame
- 1D objects are called Series. 
- 2D objects are called DataFrame. 
- The structure is Rows and Columns. 

#### The basics
Pandas documentation https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html

## 2. Pandas Series

Pandas Series is a one-dimensional array with axis labels.

### Constructing Series Objects

Series object can be created using following constrcutor, where data can be a list, dictionary or another Series.

```
pandas.Series(data, index)
``` 

#### Create from List

Without specifying index, Series will be assigned with 0-based numeric index value. 

#### Specify Index

A Series index can be specified. Following command is the same as 

```
s1 = pd.Series(range(100,105))
s1.index = list('abcde')
s2 = pd.Series(range(100,105), index = list('abcde'))
```

### Selecting Items

Items in a Series can be selected by position, which supports both single item indexing and slicing.

Items in a Series can also be selected by label.

### Specialized Dictionary

A Series object is like a Dictionary object, which maps keys (index) to values (data). But with following differences:
* Items in Series is ordered
* Series has a fixed-length
* Keys (index) in Series don't have to be unique

In fact, a Series object can be created from a dictionary.

### Filtering Data

Similiar to NumPy array, data in Series can be filtered by boolean values.

Let's generate 10 random integers between 100 and 110. Use it to create a Series object.

We can create boolean array where corresponding value in Series is greater than 105.

Filter Series using boolean values. Following statement gives same output.
```
s[s>105]
```

#### Filtering by Multiple Conditions

Multiple conditions can be combined using `&` (AND) and `|` (OR) operators.

* Find values in Series which can be divided by both 2 and 3.
* Find values in Series which can be divided by either 2 or 3.

### Missing Data and Auto Alignment

Pandas can accomodate incomplete data. Missing data will have a value of `NaN`, i.e. Not-a-Number.

Data will be automatically aligned by their index values.

Create another Series with an existing Series object and specifying new index.
* Item, whose index does not exists in original Series, is set to `NaN`
* Item, whose index does not exists in new Series, is dropped.

For example:
* Items with index 'a' and 'e' are assigned with `NaN`.
* Items with index 'g' is dropped. 

## 3. Pandas DataFrames

Pandas DataFrame is a 2-dimensional tabular data structure, which contains rows and columns.
* Columns can have different types.
* Columns can be added and removed.
* Rows and columns are indexed anc can be labeled.

<img src='images/pandas_dataframe.png' width=500 />
*Reference: https://www.geeksforgeeks.org/python-pandas-dataframe/*

### Create DataFrame

A pandas DataFrame can be created using various inputs. All columns must be equal-length.
```
pandas.DataFrame( data, index, columns)
```

It can be considered as dictionary of Series/Lists with shared row index.
* Data is commonly passed in dictionary form, whose keys will become column labels

#### Create from Lists as Columns

By default, Both DataFrame's row and column labels are integer values starting from 0. 

#### Create from Series as Columns

DataFrame can also be create from existing Series objects.

#### Create from 2D Lists as Rows

DataFrame objct can also be created using rows of data.

Rows of data are passed as a nested list object.

### Select Column(s)

Columns can be retrieved as Series
* dictionary notation
* attribute notation

To select multiple columns, use list of columns labels using dictionary notation.

### Add Column(s)

New columns can be easily added
* direct assignment
* computation from other columns

*Note: Columns cannot be added using attribute notation!*

#### Add Column of Same Value

NumPy's broadcasting feature make it easy to add a new column with same value.

### Auto Alignment

Column can be added by a Series, where indexes will be automatically aligned.

### Reindexing

Reindexing will create a new object with data conformed to the new index.
* Rows not in new index will be dropped.
* Rows not in existing index will have values of `NaN`.

### Delete a Column

To delete a column, you can use `pop()` or `drop()` functions. But they are different.
* `pop()` function modify the DataFrame object directly.
* `drop()` function returns a new object, and you must specify `axis=1` which is referred to column.

### Change Index Column


### Row Selection & Slicing

Rows can be selected using either `iloc[]` or `loc[]`.
* `iloc[]` function accepts row positions
* `loc[]` function accepts labels

### Add Rows

Add new rows to a dataFrame can be done by `append()` function. 

In [None]:
df4 = df2.append(df3)
df4

### Delete Row(s)

Rows can be deleted by `drop()` function using its label. 
* By default, `drop()` function has parameter `axis=0` which refers to row.
* `drop()` function creates a new object.