## Pandas
- Pandas is a open source python library designed specifically for accomplishing data analysis tasks.
- When we look into source code of pandas, we  can verify that Pandas library is built on top of Numpy library.
- Pandas library provides two important data structures named:
    - [pandas.Series:](https://github.com/pandas-dev/pandas/blob/v2.0.3/pandas/core/series.py#L244-L6112)
        - Series is a one-dimensional ndarray with axis labels.
        - You can think of Series as a Single Column or Single Row of data with associated labels.

    - [`pandas.DataFrame:`](https://github.com/pandas-dev/pandas/blob/v2.0.3/pandas/core/frame.py#L490-L11586)
        - A DataFrame is a two-dimensional labeled data structure with columns that can hold data of different types.
        - It is similar to a table in a relational database or a spreadsheet with rows and columns.
        - Each column in a DataFrame can be visualize as Series.
        - A DataFrame can be created from dictionaries, arrays, list of dictionaries, CSV Files, Excel Files, and more.
        - Pandas DataFrame class consists of methods and attributes that makes the life of data analyst easier.

In addition to these Data Structures, Pandas also provides various tools for Data ingestion, Data manipulation, Data Merging, Grouping and so on.


## Install Pandas and Numpy
- Before using pandas, you need to make sure pandas libary is installed into your system.
- If you are using Google Colab, there is no need to install pandas as it comes with by default.
- In addition to Pandas, we will also use Numpy library, lets install this as well (skip if you are using Google Colab)
- Uncomment below cell to install necessary libraries

In [3]:
# !pip install pandas
# !pip install numpy

## Import Libraries
- In order to use advantange of installed libraries i.e. numpy and pandas, We need to import it using `import` keyword.
- We will represent:
    - numpy via alias `np`
    - pandas via alias `pd`

In [4]:
import numpy as np
import pandas as pd

## Creating Series
- There are several ways in creating a Series:
    1. From List
    2. From Dictionary
    3. From Numpy Array

- To create a a Series you can use `pandas.Series()` function.

**1. Create a Series from python list**

In [7]:
# define list
data = [10, 20, 30, 40, 50]

# create series from list
series_list = pd.Series(data)

print(series_list)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [6]:
# check type
print(type(series_list))

<class 'pandas.core.series.Series'>


**2. Create Series from Dictionary**

In [10]:
# define dictionary
data_dict = {
    'a': 10,
    'b': 20,
    'c': 30,
    'd': 40
}

# create series from dictionary
series_dict = pd.Series(data_dict)

print(series_dict)

a    10
b    20
c    30
d    40
dtype: int64


In [9]:
# check type
print(type(series_dict))

<class 'pandas.core.series.Series'>


**Create Dictionary from Numpy Array**

In [11]:
# define array
numpy_array = np.array([10, 20, 30, 40])

# create series from numpy array
series_np = pd.Series(numpy_array)

print(series_np)

0    10
1    20
2    30
3    40
dtype: int64


- **Note:**
    - When creating `series` via list and  numpy array, Series index label are automatically generated starting from 0.
    - When creating `Series` via dictionary, Dictionary keys becomes Series index label and Dictionary values becomes the data in Series.

## Create DataFrame
- Several ways to create Pandas DataFrame are:
    1. Create DataFrame from 2D List
    2. Create DataFrame a Dictionary of Lists
    3. Create DataFrame from a List of Dictionaries
    4. Create DataFrame from a CSV file
    5. Create DataFrame from an Excel file

- To create a DataFrame you can use `pd.DataFrame()` function.

**1. Create DataFrame from 2D List**

In [12]:
# Define list
data = [[10, 20], [30, 40], [50, 60]]

# create DataFrame
df_list = pd.DataFrame(data, columns=['col1', 'col2'])

df_list

Unnamed: 0,col1,col2
0,10,20
1,30,40
2,50,60


**2. Create DataFrame a Dictionary of Lists**
- You can create a DataFrame from a Dictionary where,
    - `key` represent column names.
    - `value` represent list containing data for each column.

In [13]:
# define dictionary of lists
data_dict = {
    'Name': ['Ram', 'Gita', 'Hari'],
    'Age': [10, 25, 20]
}

# create dataframe
df_dict = pd.DataFrame(data_dict)

df_dict

Unnamed: 0,Name,Age
0,Ram,10
1,Gita,25
2,Hari,20


**3. Create DataFrame from a List of Dictionaries**
- You can also create DataFrame from a List of Dictionaries, where,
    - each dictionary corresponds to a row and contains column-value pairs.

In [14]:
# define data list
data_list = [
    {'Name': 'Ram', 'Age': 10},
    {'Name': 'Gita', 'Age': 25},
    {'Name': 'Hari', 'Age': 20}
]

# Create DataFrame
df_list = pd.DataFrame(data_list)

df_list

Unnamed: 0,Name,Age
0,Ram,10
1,Gita,25
2,Hari,20


## Data Ingestion
- Data Ingestion is the process of obtaining and import data for immediate use or storage in a database.
- This repository includes variety of Dataset, so for now we will use that as data source.
- There may be scenario where we need to ingest data from Google Drive as well for that purpose we can use `gdown` library

**5. Create DataFrame from Csv File**
- Use `pd.read_csv()` function

In [24]:
# define path
data_path = '../../Data/HR_comma_sep.csv'

# read data from csv file
df_hr = pd.read_csv(data_path)

df_hr

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.80,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low
14995,0.37,0.48,2,160,3,0,1,0,support,low
14996,0.37,0.53,2,143,3,0,1,0,support,low
14997,0.11,0.96,6,280,4,0,1,0,support,low


**5. Create DataFrame from an Excel file**
- Use  `pd.read_excel()` function

In [None]:
# write your code here

## Viewing Data
- In this sections we will explore methods available to view our loaded DataFrame object.
- **Methods are**
    - head() and tail()
    - shape
    - index and columns
    - dtypes
    - info()  
    - describe()

## Data Selection
- `Methods are:`
    - select columns (using [], and dot(.) operator)
    - loc and iloc
    - unique() and nunique()
    - value_counts()

## Aggregations
- `Methods are:`
    - mean()
    - std()
    - sum()
    - max()
    - min()

## Missing Data
- `Methods are:`
    - isnull() or isna()
    - notnull() or notna()
    - dropna()
    - fillna()

## Data Manipulation
- `Methods are:`
    - replace()
    - astype()
    - drop()
    - sort_values()
    - groupby()
    - apply()
    

## Data Join
- `Types are:`
    1. Inner Join
        - pd.merge(df1, df2, on='col_name', how='inner')
    2. Left Outer Join
        - pd.merge(df1, df2, on='col_name', how='left')
    3. Right Outer Join
        - pd.merge(df1, df2, on='col_name', how='right')
    4. Full Outer Join
        - pd.merge(df1, df2, on='col_name', how='outer')

## Save Data
- Like Read, Pandas provides several function to save data to various file formats such as CSV, Excel, JSON, and so on.
- **Save CSV**
    - `df.to_csv(csv_file_path, index=False)`
        - csv_file_path: csv path where file needs to be saved: e.g. test.csv
        - index=False: means do not save index data.

- **Save Excel**
    - `df.to_excel(excel_file_path, index=False)`

In [25]:
# write your program here