# The Data Scientist’s Toolkit
## Data Analysis with Pandas + Matplotlib

## Today

- Recap of `Python + Pandas for Real-World Data`
- Data Cleaning
- Grouping
- Visulaization


# Downloading Snippets or Execute in online.

- [Github Repository](https://github.com/skilldisk/Python-Pandas-for-Real-World-Data-Data-Wrangling-Made-Easy.git)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/skilldisk/Python-Pandas-for-Real-World-Data-Data-Wrangling-Made-Easy/main?urlpath=%2Fdoc%2Ftree%2Fjune_22.ipynb)

# Recap

# Pandas Data Structure

- Series

- DataFrame

# Creating Pandas DataStructure

## Series

- list
- tuple
- dictionary

In [None]:
import pandas as pd

In [None]:
even = [2, 4, 6, 8, 10, 12]

series_data = pd.Series(even)
series_data

In [None]:
series_data.name = "even"

In [None]:
series_data.info

In [None]:
student = {'name':"Kishore", 'age':20, 'place':"Bengaluru", 'email':"kishore@email.com"}
student_data = pd.Series(student)
student_data

In [None]:
student_data.info

# DataFrame

In [None]:
student = {
    "name": ['Kishore', 'Vinay', 'Adithya', 'Kumar', 'Vijaya'],
    "place": ['Bengaluru', 'Mysore', 'Bengaluru', 'Tumakuru', 'Mysore'],
    "age": [21, 23, 20, 21, 19]
}

student_df = pd.DataFrame(student)
student_df

In [None]:
student_df.index

In [None]:
student_df.columns

In [None]:
student_df.name

# Reading the files

In [None]:
data = pd.read_csv('./data/clean_data_crop.csv')

# Inspecting


checking how the data looks like, how many rows/columns there are, and how much data we have.

### How many rows and columns are there?

In [None]:
data.shape

### Column names

In [None]:
data.columns

### Data Types

In [None]:
data.dtypes

### What does the data look like?

In [None]:
data.head()

In [None]:
data

### Information on DataFrames

In [None]:
data.info()

## Selecting Columns

In [None]:
data.State

In [None]:
data['State']

## Multiple selection

create another data frame with info of *State*, *Year*, *Rice*.

In [None]:
data[['State','Year', 'Rice']]

## Selecting Rows

Selecting rows 11 to 20

In [None]:
data[11:21]

## Indexing

- `iloc[]` by their position
- `loc[]` by ther name

In [None]:
data.iloc[10:21]

In [None]:
data.iloc[30:35, 1:3]

In [None]:
data.loc[30:35, 'State':'Wheat']

## Filtering

using **Boolean Mask**

In [None]:
data['State'] == 'Karnataka'

In [None]:
assam_mask = (data['State'] == 'Assam') & (data['Year']== '2022-23 ')
data[assam_mask]

In [None]:
data[data['Rice']>15500][['State', 'Year', 'Rice']]
# data[data['Rice']>15000]      

# Data Cleaning

- NaN : Identify & Handle
- Data Types
    - Identify
    - Alter
- Columns
    - Rename
    - Remove
    - Create
- Duplicates

## NaN (Not a Number)

In [None]:
import pandas as pd
data = pd.read_csv('./data/clean_data_crop.csv')
data.head()

In [None]:
data.sample(10)

In [None]:
data.isna()

In [None]:
data.isna().sum()

In [None]:
data.fillna(0)

In [None]:
data.head()

In [None]:
data = data.fillna(0)
data.head()

# Data Types

In [None]:
data.info()


- int64
- float64
- datetime64[ns]
- timedelta64[ns]
- complex128
- object
- bool


In [None]:
data['Year'].apply(pd.to_datetime)

In [None]:
data['Sugarcane'].apply(pd.to_numeric)

## Using String Methods

In [None]:
data['Sugarcane'] = data['Sugarcane'].str.replace(',','')

In [None]:
data['Sugarcane'].apply(pd.to_numeric)
# data['Sugarcane'] = data['Sugarcane'].apply(pd.to_numeric)
data.info()

## Guess How to convert Year column to DateTime object?

## Hint

- Check the data 'Year' column
- Clean the string to match DateTime object
- Convert to DateTime object

## Solution

In [None]:
data['Year'].str.slice(0,4)

In [None]:
data['Year'] = data['Year'].str.slice(0,4)
data['Year'] = data['Year'].apply(pd.to_datetime)
data.info()

In [None]:
data.head()

# Columns

## Rename

In [None]:
data.head()
# data.rename(columns={'Raw Jute & Mesta':'Jute'}) 

In [None]:
data.head()

In [None]:
data = data.rename(
    columns={
        'Raw Jute & Mesta':'Jute',
        'Food-grains':'Grains',
    }
)
data.head()

## Removing Column

In [None]:
data = data.drop(columns='Jute')
data.head()

## New Columns

Will combine the cultivation of `Pulses` + `Grains` with a new column as `PG`

In [None]:
data['PG'] = data['Pulses'] + data['Grains']
data.head()

In [None]:
data = data.drop(columns='PG')
data.head()

## Duplicates

In [None]:
data.duplicated()

In [None]:
data.duplicated().sum()

In [None]:
data.duplicated('State')

In [None]:
data.duplicated('State').sum()

In [None]:
data.duplicated(['State','Year']).sum()

# Grouping

## Grouping by State

In [None]:
group_data = data.groupby('State')

In [None]:
group_data.get_group('Karnataka')

## Grouping by Year

In [None]:
data.groupby('Year')

In [None]:
data.groupby('Year').get_group('2022')

# Pivot

In [None]:
rice_data = data[['State', 'Year', 'Rice']]
rice_data.head()

In [None]:
rice_data.pivot(index="State", columns='Year')

# Visualization


- Line graph
- Subplots
- Bar graph
- Scatter Plot

In [None]:
data.sample(10)

# Using plot method

In [None]:
data.plot()

- By default it took x axis as index
- All the data Series with numberical data are plotted as line graph

Choosing `x` and `y` axis

In [None]:
data.plot(x='Year', y='Rice')

Choosing Particular State

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year', y=['Rice'], title="Karnataka")

## Try plotting the graph for other states

In [None]:
state = ""
data.groupby('State').get_group(state).plot(x='Year', y=['Rice'], title=state)

# Multiple data in Single plot

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year', y=['Rice', 'Cotton'], title="Karnataka")

## Altering Styles

In [None]:
data.groupby('State').get_group('Karnataka') \
    .plot(
    x='Year', 
    y=['Rice', 'Cotton'], 
    title="Karnataka",
    style=['r--','g-.'],
    grid=True
)

## Sub Plots

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year', y=['Rice', 'Cotton'], title="Karnataka", subplots=True)

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year', title="Karnataka")

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year', title="Karnataka",subplots=True)

## Scatter Plot

In [None]:
data.groupby('State').get_group('Karnataka').plot(x='Year',y='Rice', title="Karnataka",subplots=True, kind='scatter')

## Bar Graph

- Comparing State wise rice production for a particular Year

In [None]:
data.groupby('Year').get_group('2022').plot(x='State', y=['Rice'], title="2022", kind='bar')

In [None]:
data.groupby('Year').get_group('2022').plot(x='State', y=['Rice', 'Cotton'], title="2022", kind='bar')

In [None]:
data.groupby('Year').get_group('2022').plot(x='State', y=['Rice'], title="2022", kind='barh')