# Python + Pandas for Real-World Data
## Data Wrangling Made Easy! 

## Today

- Data & Data Structure
- Why Pandas ?
- Installation
- Pandas Data Structure
- Creating DataFrames
- Reading the files
- Inspecting
- Extracting Subsets
- Data Cleaning


# Installation

### Using pip

```cmd
  pip install pandas
```

### Conda packages

```python
  conda install pandas
    
    or
  
  conda install -c conda-forge pandas
```

### Pandas is part of major Python distributions:

* Anaconda
* ActiveState ActivePython
* WinPython
etc

### Install from Github repository.

[Pandas : Click here](https://github.com/pandas-dev/pandas)

# Pandas Data Structure

- Series

- DataFrame

# Creating Pandas DataStructure

## Series

- list
- tuple
- dictionary

In [None]:
import pandas as pd

In [None]:
even = [2, 4, 6, 8, 10, 12]

series_data = pd.Series(even)
series_data

In [None]:
series_data.name = "even"

In [None]:
series_data.info

In [None]:
student = {'name':"Kishore", 'age':20, 'place':"Bengaluru", 'email':"kishore@email.com"}
student_data = pd.Series(student)
student_data

In [None]:
student_data.info

# DataFrame

In [None]:
student = {
    "name": ['Kishore', 'Vinay', 'Adithya', 'Kumar'],
    "place": ['Bengaluru', 'Mysore', 'Bengaluru', 'Tumakuru']
}

student_df = pd.DataFrame(student)
student_df

In [None]:
student_df.index

In [None]:
student_df.columns

In [None]:
student_df.place.unique()

# Reading the files

In [None]:
data = pd.read_csv('./data/clean_data_crop.csv')

In [None]:
data_excel = pd.read_excel('./data/State-Wise_Production_of_Foodgrains_and_Major_Non-Foodgrain_Crops.xlsx')

## Installing Pandas-Extra

- `pip install pandas[excel]`
- `pip install pandas[all]`

# Inspecting


checking how the data looks like, how many rows/columns there are, and how much data we have.

### How many rows and columns are there?

In [None]:
data.shape

### Column names

In [None]:
data.columns

### Data Types

In [None]:
data.dtypes

### What does the data look like?

In [None]:
data.head()

In [None]:
data.tail()

### Information on DataFrames

In [None]:
data.info()

# Extracting Subsets

- Extracting Subsets
- Filtering
- 

## Selecting Columns

In [None]:
data.State

In [None]:
data['State']

## Multiple selection

create another data frame with info of *State*, *Year*, *Rice*.

In [None]:
data[['State','Year', 'Rice']]

## Selecting Rows

Selecting rows 11 to 20

In [None]:
data[11:21]

## Indexing

- `iloc[]` by their position
- `loc[]` by ther name

In [None]:
data.iloc[10:21]

In [None]:
data.iloc[30:35, 0:4]

In [None]:
data.loc[30:35, 'State':'Wheat']

## Filtering

using **Boolean Mask**

In [None]:
data['State'] == 'Karnataka'

In [None]:
data[data['State'] == 'Karnataka']

In [None]:
data[data['Rice']>15000][['State', 'Year', 'Rice']]

# Summary Statistics

- `value_counts()`
- `sum()`
- `min()`
- `max()`
- `unique()`, `nunique()`
- `describe()`

In [None]:
data.State.value_counts()

In [None]:
data.describe()

In [None]:
data.describe(include='all')

# Data Cleaning

- Removing columns
- **Alter data type of column**
- Creating new column

In [None]:
data = data.drop(columns='Oilseeds')
data.head()

In [None]:
data['Sugarcane'] = data['Sugarcane'].apply(pd.to_numeric)
data.info()

In [None]:
data['Sugarcane'] = data['Sugarcane'].str.replace(',','')
data['Sugarcane'] = data['Sugarcane'].apply(pd.to_numeric)
data.info()