In [4]:
print("hello world")

hello world


Data manipulation is a crucial skill for data scientists and analysts. Python’s Pandas library provides a powerful toolkit for efficiently working with structured data. In this blog, we will dive deep into Pandas and explore its functionalities for data cleaning, transformation, and analysis. By the end, you’ll have a solid understanding of how to leverage Pandas to manipulate data effectively. Let’s get started!

1. Introduction to Pandas:
Pandas is a fast, powerful, and flexible open-source data manipulation library. It introduces two primary data structures: Series and DataFrame. Let’s create a simple DataFrame using Pandas:

In [5]:
import pandas as pd

data = {'Name': ['John', 'Emma', 'Michael'],
        'Age': [25, 28, 31],
        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)
print(df)


      Name  Age         City
0     John   25     New York
1     Emma   28  Los Angeles
2  Michael   31      Chicago


Reading and Writing Data:
Pandas provides various functions to read data from different file formats. Let’s read a CSV file into a DataFrame:

In [8]:
df = pd.read_csv('data.csv')
print(df.head(1))

# To write data back to a CSV file:

df.to_csv('new_data.csv', index=False)

    stop_date stop_time  county_name driver_gender  driver_age_raw  \
0  2005-01-02     01:55          NaN             M          1985.0   

   driver_age driver_race violation_raw violation  search_conducted  \
0        20.0       White      Speeding  Speeding             False   

  search_type stop_outcome is_arrested stop_duration  drugs_related_stop  
0         NaN     Citation       False      0-15 Min               False  


### Data Cleaning and Preprocessing:
Cleaning and preprocessing data is essential before analysis. Let’s handle missing values and remove duplicates from a DataFrame:

In [9]:
# Handling missing values
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill missing values with 0

# Removing duplicates
df.drop_duplicates()

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91736,2015-12-31,20:27,,M,1986.0,29.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
91737,2015-12-31,20:35,,F,1982.0,33.0,White,Equipment/Inspection Violation,Equipment,False,,Warning,False,0-15 Min,False
91738,2015-12-31,20:45,,M,1992.0,23.0,White,Other Traffic Violation,Moving violation,False,,Warning,False,0-15 Min,False
91739,2015-12-31,21:42,,M,1993.0,22.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [12]:
df.columns

Index(['stop_date', 'stop_time', 'county_name', 'driver_gender',
       'driver_age_raw', 'driver_age', 'driver_race', 'violation_raw',
       'violation', 'search_conducted', 'search_type', 'stop_outcome',
       'is_arrested', 'stop_duration', 'drugs_related_stop'],
      dtype='object')

#### Data Transformation:
Pandas provides powerful functions for transforming data. Let’s filter rows, select columns, and sort data in a DataFrame:

In [13]:
# Filtering rows
df[df['driver_age'] > 25]


Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
6,2005-04-01,17:30,,M,1969.0,36.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
8,2005-07-13,10:15,,M,1970.0,35.0,Black,Speeding,Speeding,False,,Citation,False,0-15 Min,False
9,2005-07-13,15:45,,M,1970.0,35.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91732,2015-12-31,19:44,,F,1969.0,46.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
91733,2015-12-31,19:55,,F,1974.0,41.0,White,Registration Violation,Registration/plates,False,,Citation,False,0-15 Min,False
91736,2015-12-31,20:27,,M,1986.0,29.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
91737,2015-12-31,20:35,,F,1982.0,33.0,White,Equipment/Inspection Violation,Equipment,False,,Warning,False,0-15 Min,False


In [14]:

# Selecting columns
df[['driver_age', 'county_name']]


Unnamed: 0,driver_age,county_name
0,20.0,
1,40.0,
2,33.0,
3,19.0,
4,21.0,
...,...,...
91736,29.0,
91737,33.0,
91738,23.0,
91739,22.0,


In [15]:

# Sorting data
df.sort_values('driver_age', ascending=False)

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
71539,2013-09-08,09:50,,F,1914.0,99.0,White,Other Traffic Violation,Moving violation,False,,Citation,False,16-30 Min,False
75037,2014-02-14,10:21,,M,1920.0,94.0,White,Other Traffic Violation,Moving violation,False,,Citation,False,0-15 Min,False
76956,2014-04-25,16:40,,M,1924.0,90.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
74348,2014-01-18,13:07,,F,1925.0,89.0,White,Equipment/Inspection Violation,Equipment,False,,N/D,False,0-15 Min,False
9829,2006-09-06,12:15,,M,1918.0,88.0,White,Speeding,Speeding,False,,Citation,False,16-30 Min,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91637,2015-12-27,09:41,,,,,,,,False,,,,,False
91660,2015-12-28,02:28,,,,,,,,False,,,,,False
91674,2015-12-28,12:01,,,,,,,,False,,,,,False
91710,2015-12-30,13:27,,,,,,,,False,,,,,False


Data Aggregation and Summarization:

Aggregating and summarizing data is crucial for gaining insights. Let’s calculate the mean, count, and maximum value of a column:

In [16]:
df['driver_age'].mean()

34.011333023687875

In [20]:
# To count the distinct values
df['stop_date'].nunique()

3768

In [21]:
# Maximum value
df['driver_age'].max()

99.0

### Data Merging and Joining:

Merging and joining datasets is common when working with multiple data sources. Let’s perform an inner join between two DataFrames:

In [23]:
import pandas as pd

# Corrected data for df1 and df2 DataFrames
data1 = {'Name': ['John', 'Emma'],'City': ['New York', 'Los Angeles']}
data2 = {'Name': ['John', 'Michael'],'Age': [25, 31]}



In [24]:
# Creating df1 and df2 DataFrames
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)



In [25]:
# Merging df1 and df2 on 'Name' column using inner join
merged_df = pd.merge(df1, df2, on='Name', how='inner')
print(merged_df)


   Name      City  Age
0  John  New York   25


### Time Series Analysis:

Pandas provides functionalities for handling time series data. Let’s resample a DataFrame based on a specific time frequency:

In [26]:
df['stop_date'] = pd.to_datetime(df['stop_date'])
df.set_index('stop_date', inplace=True)



In [27]:
# resampled_df = df.resample(‘M’).sum()
# print(resampled_df)

In [30]:
# Corrected resampling and sum calculation
resampled_df = df.resample('M').sum()
print(resampled_df.head(2))

  resampled_df = df.resample('M').sum()


                  stop_time  county_name driver_gender  driver_age_raw  \
stop_date                                                                
2005-01-31  01:5508:1523:15          0.0           MMM          5922.0   
2005-02-28            17:15          0.0             M          1986.0   

            driver_age      driver_race             violation_raw  \
stop_date                                                           
2005-01-31        93.0  WhiteWhiteWhite  SpeedingSpeedingSpeeding   
2005-02-28        19.0            White          Call for Service   

                           violation  search_conducted search_type  \
stop_date                                                            
2005-01-31  SpeedingSpeedingSpeeding                 0           0   
2005-02-28                     Other                 0           0   

                        stop_outcome is_arrested             stop_duration  \
stop_date                                                          

    df.resample('M') resamples the DataFrame df with a monthly frequency ('M').
    .sum() calculates the sum for each monthly period.

After running this code, resampled_df will contain the sum for each column for each monthly period based on the resampling frequency ('M').

In [33]:
# pip install seaborn

In [35]:
# pip install matplotlib

In [36]:
import seaborn as sns
import matplotlib.pyplot as plt


#### Data Visualization with Pandas:

Pandas offers built-in visualization capabilities. Let’s create a line plot and a bar plot using Pandas:

In [40]:
df = pd.read_csv('Sales.csv')

In [41]:
df.head(2)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,Chad,Office Supplies,Online,L,1/27/2011,292494523,2/12/2011,4484,651.21,524.96,2920025.64,2353920.64,566105.0
1,Europe,Latvia,Beverages,Online,C,12/28/2015,361825549,1/23/2016,1075,47.45,31.79,51008.75,34174.25,16834.5


In [43]:
df.shape

(10000, 14)

In [49]:
# df.plot(kind='line', x='Item Type', y='Unit Price')

# df.plot(kind='bar', x='Region', y='Total Revenue')

# df.plot(kind='pie', x='Region', y='Total Revenue')

Advanced Data Manipulation Techniques:

Pandas supports advanced techniques for data manipulation. Let’s apply a function to a column and handle categorical data:

In [73]:
df1 = pd.read_csv('tips.csv')

In [74]:
df1.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3


In [75]:
# Applying a function to a column
df1['total_bill'] = df1['total_bill'].apply(lambda x: x + 10)



In [77]:
df1.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,26.99,1.01,Female,No,Sun,Dinner,2
1,20.34,1.66,Male,No,Sun,Dinner,3


In [78]:
df1.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,29.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,13.07,1.0,1.0
25%,23.3475,2.0,2.0
50%,27.795,2.9,2.0
75%,34.1275,3.5625,3.0
max,60.81,10.0,6.0


In [80]:
df1.dtypes

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

In [81]:
# Handling categorical data
df1['smoker'] = df1['smoker'].astype('category')

In [82]:
df1.dtypes

total_bill     float64
tip            float64
sex             object
smoker        category
day             object
time            object
size             int64
dtype: object

    df['smoker'].astype('category') converts the 'smoker' column to a categorical data type.

After running this code, the 'smoker' column in your DataFrame df will be of type 'category'. You can verify the data types using df.dtypes to see the changes.

### Performance Optimization:

For large datasets, optimizing performance is crucial. Let’s leverage vectorized operations and parallel processing to enhance performance:

In [84]:
# Vectorized operations
df1['total_bill'] = df1['total_bill'] * 1.1



In [85]:
# Parallel processing
import multiprocessing

def process_data(row):
    # Process each row
    pass



In [86]:
pool = multiprocessing.Pool()
df = pd.concat(pool.map(process_data, df))
pool.close()
pool.join()