# Introduction to Pandas

Pandas is an open-source Python library widely used for data manipulation and analysis. It's favored by data scientists because it provides powerful, flexible tools for working with structured (tabular) data, similar to spreadsheets or SQL tables. Built on top of the NumPy library, pandas integrates well with other key scientific libraries like Matplotlib for plotting, SciPy for advanced calculations, and Scikit-learn for machine learning tasks.

### Key Features of Pandas:
* **Data structures:** Two main objects—Series (one-dimensional labeled array) and DataFrame (two-dimensional table with labeled axes).

* **Data import/export:** Read and write data using common formats (CSV, Excel, SQL, and more).

* **Data cleaning:** Functions for handling missing values, merging/joining datasets, filtering, renaming, and deduplicating.

* **Data analysis:** Built-in methods for aggregation, grouping ("split-apply-combine"), and descriptive statistics.

* **NOTE:** The code given below is intended for use with Google Colab. It allows you to directly access your Google Drive and work with datasets stored there. This approach is especially helpful because Google Colab’s free-tier sessions may disconnect or reset at any time, which can lead to data loss if the data is stored only in the Colab runtime. By saving your data and results in Google Drive, you ensure they remain safe and persistent, even if the Colab runtime gets disconnected or restarted.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [8]:
# Importing pandas library
import pandas as pd

## Pandas Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type, such as integers, floats, strings, or even Python objects. You can think of it like a single column in a spreadsheet or a database table.

### Key Features:
* **Flexible data:** Can store different types of data (numeric, string, object, etc.).

* **Indexing:** Each element in a Series has an associated label, called an index. Indexing allows both integer-based and label-based access to elements.

* **Easy creation:** A Series can be created from lists, arrays, dictionaries, or even scalar values using the pd.Series() constructor.

* **Built-in methods:** Offers many methods for data manipulation, such as filtering, aggregation, and mathematical operations.

In [16]:
## EXAMPLE-1
# Creating a Series from a list
data = [10, 20, 30, 40]
my_series = pd.Series(data)
print(my_series)

0    10
1    20
2    30
3    40
dtype: int64


You can also provide custom index labels:

In [17]:
## EXAMPLE-2
my_series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(my_series)


a    10
b    20
c    30
dtype: int64


## Pandas DataFrame

A Pandas DataFrame is a two-dimensional, labeled data structure in Python that resembles a table with rows and columns, similar to a spreadsheet or an SQL database table. It is one of the core data structures in the pandas library, widely used for data manipulation and analysis.

**Key Features of DataFrame:**
* **Shape:** Two-dimensional – stores data in rows and columns.

* **Flexible data types:** Each column can hold different types of data (int, float, string, etc.).

* **Labeled axes:** Rows and columns are both labeled, making data selection and manipulation intuitive.

* **Size mutable:** Can easily add or drop columns and rows.

* **Powerful operations:** Supports filtering, sorting, grouping, merging, aggregation, arithmetic operations, and much more.

* **Handles missing data:** Has built-in support for detecting and handling missing values.

In [21]:
## EXAMPLE
data = {'Name': ['Tom', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,28
1,Anna,24


Now let's learn some basic and commonly used pandas functions that are essential for working with data in Series and DataFrames.

In [23]:
# Reading a CSV file into a DataFrame
# Make sure to replace "database.csv" with the actual path to your CSV file
df = pd.read_csv("database.csv")

`df.head()` Returns the first n rows of a DataFrame or Series (default 5).

In [24]:
df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,...,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,...,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,...,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic


`df.tail()` Returns the last n rows of a DataFrame or Series (default 5).

In [25]:
df.tail()  # Display the last few rows of the DataFrame

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
23407,12/28/2016,08:22:12,38.3917,-118.8941,Earthquake,12.3,1.2,40.0,5.6,ML,...,18.0,42.47,0.12,,0.1898,NN00570710,NN,NN,NN,Reviewed
23408,12/28/2016,09:13:47,38.3777,-118.8957,Earthquake,8.8,2.0,33.0,5.5,ML,...,18.0,48.58,0.129,,0.2187,NN00570744,NN,NN,NN,Reviewed
23409,12/28/2016,12:38:51,36.9179,140.4262,Earthquake,10.0,1.8,,5.9,MWW,...,,91.0,0.992,4.8,1.52,US10007NAF,US,US,US,Reviewed
23410,12/29/2016,22:30:19,-9.0283,118.6639,Earthquake,79.0,1.8,,6.3,MWW,...,,26.0,3.553,6.0,1.43,US10007NL0,US,US,US,Reviewed
23411,12/30/2016,20:08:28,37.3973,141.4103,Earthquake,11.94,2.2,,5.5,MB,...,428.0,97.0,0.681,4.5,0.91,US10007NTD,US,US,US,Reviewed


`df.sample()`  Returns a random sample of n rows of a DataFrame or Series (default 1).

In [26]:
df.sample(5)  # Display a random sample of 5 rows from the DataFrame

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
13075,06/14/1996,15:04:41,12.811,125.055,Earthquake,28.8,,,6.1,MWC,...,,,,,1.2,USP0007JRM,US,US,HRV,Reviewed
14342,04/11/1999,16:50:39,-6.0,148.495,Earthquake,58.3,5.2,,6.0,MWC,...,,,,,0.98,USP000964K,US,US,HRV,Reviewed
16015,10/16/2002,14:13:13,-15.676,-173.048,Earthquake,33.0,,256.0,6.0,MWB,...,,,,,0.99,USP000BEP9,US,US,US,Reviewed
13087,06/28/1996,02:41:13,-21.712,-175.213,Earthquake,35.6,,,5.6,MWC,...,,,,,0.8,USP0007KCR,US,US,HRV,Reviewed
13137,08/10/1996,00:20:50,-3.828,151.157,Earthquake,33.0,,,5.7,MWC,...,,,,,1.1,USP0007NB8,US,US,HRV,Reviewed


`df.info()` Provides a concise summary of the DataFrame including column types, non-null counts, and memory usage.

In [27]:
df.info()  # Display a concise summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23412 entries, 0 to 23411
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Date                        23412 non-null  object 
 1   Time                        23412 non-null  object 
 2   Latitude                    23412 non-null  float64
 3   Longitude                   23412 non-null  float64
 4   Type                        23412 non-null  object 
 5   Depth                       23412 non-null  float64
 6   Depth Error                 4461 non-null   float64
 7   Depth Seismic Stations      7097 non-null   float64
 8   Magnitude                   23412 non-null  float64
 9   Magnitude Type              23409 non-null  object 
 10  Magnitude Error             327 non-null    float64
 11  Magnitude Seismic Stations  2564 non-null   float64
 12  Azimuthal Gap               7299 non-null   float64
 13  Horizontal Distance         160

`df.describe()` Generates summary statistics like count, mean, std deviation, min, max for all numeric columns in this dataframe.

In [28]:
df.describe()  # Generate summary statistics for numeric columns

Unnamed: 0,Latitude,Longitude,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square
count,23412.0,23412.0,23412.0,4461.0,7097.0,23412.0,327.0,2564.0,7299.0,1604.0,1156.0,17352.0
mean,1.679033,39.639961,70.767911,4.993115,275.364098,5.882531,0.07182,48.944618,44.163532,3.99266,7.662759,1.022784
std,30.113183,125.511959,122.651898,4.875184,162.141631,0.423066,0.051466,62.943106,32.141486,5.377262,10.430396,0.188545
min,-77.08,-179.997,-1.1,0.0,0.0,5.5,0.0,0.0,0.0,0.004505,0.085,0.0
25%,-18.653,-76.34975,14.5225,1.8,146.0,5.6,0.046,10.0,24.1,0.96875,5.3,0.9
50%,-3.5685,103.982,33.0,3.5,255.0,5.7,0.059,28.0,36.0,2.3195,6.7,1.0
75%,26.19075,145.02625,54.0,6.3,384.0,6.0,0.0755,66.0,54.0,4.7245,8.1,1.13
max,86.005,179.998,700.0,91.295,934.0,9.1,0.41,821.0,360.0,37.874,99.0,3.44


`df.columns` returns an Index object containing the column names, which can be useful for inspecting, renaming, or managing columns in your data.

In [29]:
df.columns  # Display the column names of the DataFrame

Index(['Date', 'Time', 'Latitude', 'Longitude', 'Type', 'Depth', 'Depth Error',
       'Depth Seismic Stations', 'Magnitude', 'Magnitude Type',
       'Magnitude Error', 'Magnitude Seismic Stations', 'Azimuthal Gap',
       'Horizontal Distance', 'Horizontal Error', 'Root Mean Square', 'ID',
       'Source', 'Location Source', 'Magnitude Source', 'Status'],
      dtype='object')

`df.shape` provides the number of rows and columns as (rows, columns) of this dataframe.

In [30]:
df.shape

(23412, 21)

`df.size` returns the total number of elements in a DataFrame or Series.

In [31]:
df.size

491652

`value_counts()` counts unique values in a Series or column of a dataframe.

In [32]:
## EXAMPLE
df['Status'].value_counts()

Status
Reviewed     20773
Automatic     2639
Name: count, dtype: int64

`df.isnull()` Detects missing and returns boolean Series.

In [None]:
df.isnull()  # Detects missing and returns boolean Series.

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,False,False,False,False,False,False,True,True,False,False,...,True,True,True,True,True,False,False,False,False,False
1,False,False,False,False,False,False,True,True,False,False,...,True,True,True,True,True,False,False,False,False,False
2,False,False,False,False,False,False,True,True,False,False,...,True,True,True,True,True,False,False,False,False,False
3,False,False,False,False,False,False,True,True,False,False,...,True,True,True,True,True,False,False,False,False,False
4,False,False,False,False,False,False,True,True,False,False,...,True,True,True,True,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23407,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
23408,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
23409,False,False,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,False,False,False,False
23410,False,False,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,False,False,False,False


`df.isnull().sum()` in pandas is used to count the number of missing (NaN) values in each column of a DataFrame.

In [34]:
df.isnull().sum()  # Count the number of missing values in each column

Date                              0
Time                              0
Latitude                          0
Longitude                         0
Type                              0
Depth                             0
Depth Error                   18951
Depth Seismic Stations        16315
Magnitude                         0
Magnitude Type                    3
Magnitude Error               23085
Magnitude Seismic Stations    20848
Azimuthal Gap                 16113
Horizontal Distance           21808
Horizontal Error              22256
Root Mean Square               6060
ID                                0
Source                            0
Location Source                   0
Magnitude Source                  0
Status                            0
dtype: int64

`df.notnull().sum()` in pandas is used to count the number of non-missing values in each column of a DataFrame.

In [35]:
df.notnull().sum()  # Count the number of non-missing values in each column

Date                          23412
Time                          23412
Latitude                      23412
Longitude                     23412
Type                          23412
Depth                         23412
Depth Error                    4461
Depth Seismic Stations         7097
Magnitude                     23412
Magnitude Type                23409
Magnitude Error                 327
Magnitude Seismic Stations     2564
Azimuthal Gap                  7299
Horizontal Distance            1604
Horizontal Error               1156
Root Mean Square              17352
ID                            23412
Source                        23412
Location Source               23412
Magnitude Source              23412
Status                        23412
dtype: int64

* **NOTE:** Most of the time we use `isnull()` to count how many missing value, and we don't use `notnull()` everytime like `isnull()` we use, but it's good practise to learn these kind of functions which helps you at sometimes.