# Author: Srikar Kalle  

**Student ID: C00313529**

**Date: December 13, 2024**

**Project: Earthquake Insights Analysis**

In [1]:
import pandas as pd
from sklearn.cluster import KMeans

In [2]:
df = pd.read_csv('quakes-cleaned.csv')

# Insight 1: Most seismic activity by region

The topRegions DataFrame identifies the top 25 regions with the highest frequency of seismic activity. This is done by counting how many earthquakes occurred in each region and selecting the top 25 regions with the most frequent seismic events.

In [3]:
topRegions = df['place'].value_counts().head(25)

In [4]:
topRegions

place
16 km NE of Milford, Utah         117
22 km NNE of Yerington, Nevada    110
66 km WNW of Beluga, Alaska       107
21 km NNE of Yerington, Nevada    107
65 km WNW of Beluga, Alaska        86
23 km NNE of Yerington, Nevada     83
8 km NNW of The Geysers, CA        75
15 km NE of Milford, Utah          71
20 km NNE of Yerington, Nevada     67
9 km NW of The Geysers, CA         65
7 km NW of The Geysers, CA         61
7 km WNW of Cobb, CA               61
67 km WNW of Beluga, Alaska        55
10 km NW of The Geysers, CA        53
8 km WNW of Cobb, CA               52
7 km NNW of The Geysers, CA        48
24 km NNE of Yerington, Nevada     39
6 km WNW of Cobb, CA               39
64 km WNW of Tyonek, Alaska        38
9 km WNW of Cobb, CA               36
6 km NNW of The Geysers, CA        35
2 km NNW of The Geysers, CA        33
66 km WNW of Tyonek, Alaska        32
67 km WNW of Tyonek, Alaska        31
6 km NW of The Geysers, CA         30
Name: count, dtype: int64

# Insight 2: Most active time of day

In [5]:
df['time'] = pd.to_datetime(df['time'], errors='coerce')

if df['time'].isnull().any():
    print("Warning: Some rows have invalid time entries and were set to NaT.")



In [6]:
df = df.dropna(subset=['time'])

print(f"Number of rows with invalid 'time' after drop: {df['time'].isnull().sum()}")

Number of rows with invalid 'time' after drop: 0


In [7]:
df['hour'] = df['time'].dt.hour

Categorizing Earthquake Activity by Time of Day - In the dataset, each earthquake event is associated with a specific time. To better understand the distribution of earthquakes throughout the day, we categorize the time of occurrence into four periods: Night, Morning, Afternoon, and Evening.


0 to 6 hours: Night


6 to 12 hours: Morning


12 to 18 hours: Afternoon


18 to 24 hours: Evening

In [8]:
timeOfDay = pd.cut(df['hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'])

Count occurrences in each time of day bin

In [9]:
mostActiveTime = timeOfDay.value_counts()

In [10]:
mostActiveTime

hour
Morning      2621
Night        2537
Afternoon    2414
Evening      2056
Name: count, dtype: int64

# Insight 3: Detect patterns or anomalies in earthquake occurrences over the study timeframe.

The dailyCounts DataFrame puts into view the occurrences of earthquakes on a daily basis, which is helpful for the detection of patterns or anomalies in time. By grouping the data by date and counting the number of earthquakes for each day, we are able to observe trends, such as:

Patterns: A regular increase or decrease in earthquake occurrences over a specific period may indicate seasonal, environmental, or geological factors affecting seismic activity.



Anomalies: Unforeseen rises or drops in earthquake occurrences for some dates may relate to very special seismic activities or individual causes like natural disasters, tectonic shifts, and other geological activities.
This daily breakdown of earthquake activity provides a clear view of seismic events over time, assisting in the identification of days exhibiting high or low activity than usual. Plotting this allows for an easy visual identification of the pattern and anomalies.

In [11]:
df['date'] = pd.to_datetime(df['time']).dt.date
dailyCounts = df.groupby('date').size().reset_index(name='count')
dailyCounts

Unnamed: 0,date,count
0,2024-11-30,77
1,2024-12-01,350
2,2024-12-02,352
3,2024-12-03,372
4,2024-12-04,371
5,2024-12-05,616
6,2024-12-06,440
7,2024-12-07,392
8,2024-12-08,389
9,2024-12-09,576


# Insights 4: Analyze magnitude statistics for top regions.

The regionMag DataFrame calculates and ranks the top 35 regions by their average earthquake magnitude. To do this, it groups by the place column, representing regions, calculates the mean magnitude (mag) for each region, and sorts in descending order.

In [12]:
regionMag = df.groupby('place')['mag'].mean().sort_values(ascending=False).head(35).reset_index()

In [13]:
regionMag

Unnamed: 0,place,mag
0,"2024 Offshore Cape Mendocino, California Earth...",7.0
1,"30 km W of Port-Vila, Vanuatu",6.7
2,"56 km ESE of Molina, Chile",6.4
3,"37 km S of Guisa, Cuba",5.9
4,"26 km SE of Tinogasta, Argentina",5.9
5,"136 km W of Neiafu, Tonga",5.9
6,"2024 Parker Butte, Nevada Earthquake",5.7
7,"115 km SW of Adak, Alaska",5.7
8,"9 km S of Conchagua, El Salvador",5.6
9,"255 km E of Levuka, Fiji",5.6


In [14]:
regionMag = (
    df.groupby('place')['mag']
    .mean()
    .sort_values(ascending=False)
    .head(35)
    .reset_index()
)

coordinates = df[['place', 'latitude', 'longitude']].drop_duplicates(subset='place')
regionMag = regionMag.merge(coordinates, on='place', how='left')
regionMag

Unnamed: 0,place,mag,latitude,longitude
0,"2024 Offshore Cape Mendocino, California Earth...",7.0,40.374,-125.021667
1,"30 km W of Port-Vila, Vanuatu",6.7,-17.7098,168.0292
2,"56 km ESE of Molina, Chile",6.4,-35.3392,-70.7315
3,"37 km S of Guisa, Cuba",5.9,19.9162,-76.5153
4,"26 km SE of Tinogasta, Argentina",5.9,-28.2378,-67.3792
5,"136 km W of Neiafu, Tonga",5.9,-18.5653,-175.2725
6,"2024 Parker Butte, Nevada Earthquake",5.7,39.1675,-119.0238
7,"115 km SW of Adak, Alaska",5.7,51.0239,-177.5884
8,"9 km S of Conchagua, El Salvador",5.6,13.2206,-87.875
9,"255 km E of Levuka, Fiji",5.6,-18.1647,-178.274


# Insights 5: Weekly Earthquake Activity by Day

This insight examines the occurrences of earthquakes by day of the week, aggregated on a weekly basis. It helps to highlight any underlying trends or anomalies in seismic activity over the period in question. The code snippet carries out the following steps:

In [15]:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
df['date'] = df['time'].dt.date
daily_activity = df['date'].value_counts().sort_index()

daily_activity.index = pd.to_datetime(daily_activity.index)

daily_activity = daily_activity.reset_index()
daily_activity.columns = ['date', 'count']

daily_activity['week'] = daily_activity['date'].dt.isocalendar().week

heatmap_data = daily_activity.pivot_table(index='week', columns=daily_activity['date'].dt.day, values='count', aggfunc='sum')
heatmap_data

date,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,179.0
48,350.0,,,,,,,,,,...,,,,,,,,,,77.0
49,,352.0,372.0,371.0,616.0,440.0,392.0,389.0,,,...,,,,,,,,,,
50,,,,,,,,,576.0,557.0,...,,,,,,,,,,
51,,,,,,,,,,,...,226.0,212.0,,,,,,,,
52,,,,,,,,,,,...,,,279.0,225.0,199.0,242.0,284.0,325.0,230.0,
