Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrame and Series, making it easy to handle and analyze structured data. With functions for cleaning, filtering, and transforming data, Pandas is widely used in data science and analysis tasks.

# Why do we use pandas


*   Data Structures(Series and Dataframe Structures)
*   Ease of data cleaning
*   Data Exploration
*   Handling different data types



In [6]:
# run pip install pandas if you can't import pandas
import pandas as pd

DataFrames are two-dimensional, tabular data structures in the Pandas library. They are similar to spreadsheets or SQL tables, with rows and columns. Each column in a DataFrame can be of a different data type (e.g., integers, floats, strings), and you can perform various operations like filtering, grouping, and merging to analyze and manipulate the data efficiently. DataFrames are a key feature of Pandas, making it easier to work with and analyze structured data.

In [7]:
# creating dataframes
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [28, 24, 22],
        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

Loading and saving data

In [12]:
# Reading from a CSV file
data = pd.read_csv('matches - matches.csv')
data

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2017,Hyderabad,2017-04-05,Sunrisers Hyderabad,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Sunrisers Hyderabad,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong,
1,2,2017,Pune,2017-04-06,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant,field,normal,0,Rising Pune Supergiant,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi,
2,3,2017,Rajkot,2017-04-07,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Kolkata Knight Riders,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan,
3,4,2017,Indore,2017-04-08,Rising Pune Supergiant,Kings XI Punjab,Kings XI Punjab,field,normal,0,Kings XI Punjab,0,6,GJ Maxwell,Holkar Cricket Stadium,AK Chaudhary,C Shamshuddin,
4,5,2017,Bangalore,2017-04-08,Royal Challengers Bangalore,Delhi Daredevils,Royal Challengers Bangalore,bat,normal,0,Royal Challengers Bangalore,15,0,KM Jadhav,M Chinnaswamy Stadium,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
631,632,2016,Raipur,2016-05-22,Delhi Daredevils,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Royal Challengers Bangalore,0,6,V Kohli,Shaheed Veer Narayan Singh International Stadium,A Nand Kishore,BNJ Oxenford,
632,633,2016,Bangalore,2016-05-24,Gujarat Lions,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Royal Challengers Bangalore,0,4,AB de Villiers,M Chinnaswamy Stadium,AK Chaudhary,HDPK Dharmasena,
633,634,2016,Delhi,2016-05-25,Sunrisers Hyderabad,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Sunrisers Hyderabad,22,0,MC Henriques,Feroz Shah Kotla,M Erasmus,C Shamshuddin,
634,635,2016,Delhi,2016-05-27,Gujarat Lions,Sunrisers Hyderabad,Sunrisers Hyderabad,field,normal,0,Sunrisers Hyderabad,0,4,DA Warner,Feroz Shah Kotla,M Erasmus,CK Nandan,


In Pandas, you can inspect your data using various methods. Some common ones include:

1. **head():** Displays the first few rows of the DataFrame.
   ```python
   df.head()
   ```

2. **tail():** Shows the last few rows of the DataFrame.
   ```python
   df.tail()
   ```

3. **info():** Provides information about the DataFrame, including data types and missing values.
   ```python
   df.info()
   ```

4. **describe():** Generates descriptive statistics, like mean and standard deviation, for numerical columns.
   ```python
   df.describe()
   ```

5. **shape:** Returns the number of rows and columns in the DataFrame.
   ```python
   df.shape
   ```

These functions help you get a quick overview of your dataset and understand its structure and content.

Data Selection and Indexing

In [13]:
data.head()

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2017,Hyderabad,2017-04-05,Sunrisers Hyderabad,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Sunrisers Hyderabad,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong,
1,2,2017,Pune,2017-04-06,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant,field,normal,0,Rising Pune Supergiant,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi,
2,3,2017,Rajkot,2017-04-07,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Kolkata Knight Riders,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan,
3,4,2017,Indore,2017-04-08,Rising Pune Supergiant,Kings XI Punjab,Kings XI Punjab,field,normal,0,Kings XI Punjab,0,6,GJ Maxwell,Holkar Cricket Stadium,AK Chaudhary,C Shamshuddin,
4,5,2017,Bangalore,2017-04-08,Royal Challengers Bangalore,Delhi Daredevils,Royal Challengers Bangalore,bat,normal,0,Royal Challengers Bangalore,15,0,KM Jadhav,M Chinnaswamy Stadium,,,


In [14]:
data.head(3)

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2017,Hyderabad,2017-04-05,Sunrisers Hyderabad,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Sunrisers Hyderabad,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong,
1,2,2017,Pune,2017-04-06,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant,field,normal,0,Rising Pune Supergiant,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi,
2,3,2017,Rajkot,2017-04-07,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Kolkata Knight Riders,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan,


In [15]:
data.tail(3)

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
633,634,2016,Delhi,2016-05-25,Sunrisers Hyderabad,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Sunrisers Hyderabad,22,0,MC Henriques,Feroz Shah Kotla,M Erasmus,C Shamshuddin,
634,635,2016,Delhi,2016-05-27,Gujarat Lions,Sunrisers Hyderabad,Sunrisers Hyderabad,field,normal,0,Sunrisers Hyderabad,0,4,DA Warner,Feroz Shah Kotla,M Erasmus,CK Nandan,
635,636,2016,Bangalore,2016-05-29,Sunrisers Hyderabad,Royal Challengers Bangalore,Sunrisers Hyderabad,bat,normal,0,Sunrisers Hyderabad,8,0,BCJ Cutting,M Chinnaswamy Stadium,HDPK Dharmasena,BNJ Oxenford,


In [16]:
data.shape

(636, 18)

In [17]:
data.info

<bound method DataFrame.info of       id  season       city        date                        team1  \
0      1    2017  Hyderabad  2017-04-05          Sunrisers Hyderabad   
1      2    2017       Pune  2017-04-06               Mumbai Indians   
2      3    2017     Rajkot  2017-04-07                Gujarat Lions   
3      4    2017     Indore  2017-04-08       Rising Pune Supergiant   
4      5    2017  Bangalore  2017-04-08  Royal Challengers Bangalore   
..   ...     ...        ...         ...                          ...   
631  632    2016     Raipur  2016-05-22             Delhi Daredevils   
632  633    2016  Bangalore  2016-05-24                Gujarat Lions   
633  634    2016      Delhi  2016-05-25          Sunrisers Hyderabad   
634  635    2016      Delhi  2016-05-27                Gujarat Lions   
635  636    2016  Bangalore  2016-05-29          Sunrisers Hyderabad   

                           team2                  toss_winner toss_decision  \
0    Royal Challengers B

In [18]:
data.describe()

Unnamed: 0,id,season,dl_applied,win_by_runs,win_by_wickets,umpire3
count,636.0,636.0,636.0,636.0,636.0,0.0
mean,318.5,2012.490566,0.025157,13.68239,3.372642,
std,183.741666,2.773026,0.156726,23.908877,3.420338,
min,1.0,2008.0,0.0,0.0,0.0,
25%,159.75,2010.0,0.0,0.0,0.0,
50%,318.5,2012.0,0.0,0.0,4.0,
75%,477.25,2015.0,0.0,20.0,7.0,
max,636.0,2017.0,1.0,146.0,10.0,


In [19]:
data['winner']

0              Sunrisers Hyderabad
1           Rising Pune Supergiant
2            Kolkata Knight Riders
3                  Kings XI Punjab
4      Royal Challengers Bangalore
                  ...             
631    Royal Challengers Bangalore
632    Royal Challengers Bangalore
633            Sunrisers Hyderabad
634            Sunrisers Hyderabad
635            Sunrisers Hyderabad
Name: winner, Length: 636, dtype: object

In [22]:
data['winner'].shape

(636,)

In [20]:
data[['team1','team2','winner']]

Unnamed: 0,team1,team2,winner
0,Sunrisers Hyderabad,Royal Challengers Bangalore,Sunrisers Hyderabad
1,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant
2,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders
3,Rising Pune Supergiant,Kings XI Punjab,Kings XI Punjab
4,Royal Challengers Bangalore,Delhi Daredevils,Royal Challengers Bangalore
...,...,...,...
631,Delhi Daredevils,Royal Challengers Bangalore,Royal Challengers Bangalore
632,Gujarat Lions,Royal Challengers Bangalore,Royal Challengers Bangalore
633,Sunrisers Hyderabad,Kolkata Knight Riders,Sunrisers Hyderabad
634,Gujarat Lions,Sunrisers Hyderabad,Sunrisers Hyderabad


In [21]:
data[['team1','team2','winner']].shape

(636, 3)

Data Cleaning

In [None]:
# Handling missing values
df.dropna()

# Removing duplicates
df.drop_duplicates()

Data Manipulation with pandas

In [None]:
# Merging DataFrames
merged_df = pd.merge(df1, df2, on='common_column')

# Grouping and aggregating
grouped_df = df.groupby('City').mean()

# Applying functions
df['Age'] = df['Age'].apply(lambda x: x + 1)

Data Visualization with pandas

In [None]:
# Plotting data
df.plot(kind='bar', x='Name', y='Age')

Understanding Categorical and Numerical  data

Pandas provides various methods to handle both categorical and numerical data:

**Handling Numerical Data:**
1. **Descriptive Statistics:** Use `describe()` to get statistical summary.
   ```python
   df.describe()
   ```

2. **Math Operations:** Perform mathematical operations on numerical columns.
   ```python
   df['numerical_column'].mean()
   ```

**Handling Categorical Data:**
1. **Value Counts:** Check the distribution of categorical values.
   ```python
   df['categorical_column'].value_counts()
   ```

2. **Label Encoding:** Convert categorical values to numerical labels.
   ```python
   from sklearn.preprocessing import LabelEncoder
   le = LabelEncoder()
   df['encoded_column'] = le.fit_transform(df['categorical_column'])
   ```

3. **One-Hot Encoding:** Create binary columns for each category (useful for machine learning models).
   ```python
   df_encoded = pd.get_dummies(df, columns=['categorical_column'])
   ```

These methods allow you to analyze, preprocess, and prepare both numerical and categorical data for various data science tasks.

Reshaping Data

In [None]:
# Melting DataFrames
df_melted = pd.melt(df, id_vars=['Name'], var_name='Attribute', value_name='Value')

# Pivoting DataFrames
df_pivoted = df.pivot(index='Name', columns='Attribute', values='Value')