# Earthquakes Alerts Tweets Dataset Analysis:

We have a Dataset on Worldwide earthquake alerts based on USGS data of earthquakes of 1.5 magnitude and higher.
It is collected from the public tweets by Earthquakes Alerts Twitter. The Data is collected from date Feb-16-2023 and updated upto June-4-2023. 

We will analyse this dataset and draw some useful insights by using a Data Visualization tool named Power Bi desktop.

In [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [113]:
df = pd.read_csv("QuakesToday.csv")

In [114]:
df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username
0,2023-06-04 22:55:03+00:00,1665492389220487168,"1.6 magnitude #earthquake. 38 km from Valdez, ...",QuakesToday
1,2023-06-04 22:40:03+00:00,1665488613130514432,1.7 magnitude #earthquake. 84 km ESE of Egegik...,QuakesToday
2,2023-06-04 22:23:03+00:00,1665484336316174339,"5.0 magnitude #earthquake. 122 km from Itoman,...",QuakesToday
3,2023-06-04 21:03:03+00:00,1665464203505975296,4.4 magnitude #earthquake. 222 km SSW of Dunhu...,QuakesToday
4,2023-06-04 20:27:03+00:00,1665455143066963968,"4.1 magnitude #earthquake. 8 km from Abricots,...",QuakesToday


In [115]:
df.shape

(13049, 4)

##  Handling the Dataset

As we can see, more than one type of data is given in the 'Text' column.  

So, firstly we will have to seperate this data into different columns.

To do so, I'll be creating three new columns such as 'Magnitude', 'Area', 'Country' and 'Source' to seperate the respective value from 'Text' column.

- __'Magnitude'__ - Stores the integer value of the magnitude of earthquakes.
- __'Area'__ - Contains the area in which the earthquake was experienced.
- __'Source'__ - Contains the link of the source from which the data is taken.
- __'Country'__ - Contains the country where earthquake was experienced.

In [116]:
df["Text"] = df["Text".split(",")]

In [117]:
df["Magnitude"] = df["Text"].str.split(" mag").str[0]
df["Area"] = df["Text"].str.split("quake. ").str[1]
df["Source"] = df["Area"].str.split("https").str[1]
df["Area"] = df["Area"].str.split("https").str[0]


I will manipulate the data in newly created columns to get the correct type of data.

For that, I'll add 'https' to each entry in Source column which was skipped while splitting. 
Also, I'll delete all the '#' from the data in Country column.

In [118]:
df["Country"] = df["Area"].str.rsplit(",", 1).str[1]
df["Area"] = df["Area"].str.split(",", 1).str[0]

df["Source"] = df["Source"].apply(lambda x: "https" + x)
df["Country"] = df["Country"].str.replace("#", "")

In [119]:
df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username,Magnitude,Area,Source,Country
0,2023-06-04 22:55:03+00:00,1665492389220487168,"1.6 magnitude #earthquake. 38 km from Valdez, ...",QuakesToday,1.6,38 km from Valdez,https://t.co/kXHXTDgnNx,UnitedStates
1,2023-06-04 22:40:03+00:00,1665488613130514432,1.7 magnitude #earthquake. 84 km ESE of Egegik...,QuakesToday,1.7,84 km ESE of Egegik,https://t.co/Sdx0vUplys,Alaska
2,2023-06-04 22:23:03+00:00,1665484336316174339,"5.0 magnitude #earthquake. 122 km from Itoman,...",QuakesToday,5.0,122 km from Itoman,https://t.co/lOyz1MVMIE,Japan
3,2023-06-04 21:03:03+00:00,1665464203505975296,4.4 magnitude #earthquake. 222 km SSW of Dunhu...,QuakesToday,4.4,222 km SSW of Dunhuang,https://t.co/PRolgouAiG,China
4,2023-06-04 20:27:03+00:00,1665455143066963968,"4.1 magnitude #earthquake. 8 km from Abricots,...",QuakesToday,4.1,8 km from Abricots,https://t.co/XJta3g29Mu,Haiti


In [120]:
#Changing the datatype of 'Datetime' and seperating the values in two different columns.

df["Date"] = pd.to_datetime(df["Datetime"]).dt.date
df["Time"] = pd.to_datetime(df["Datetime"]).dt.time

df.head()

Unnamed: 0,Datetime,Tweet Id,Text,Username,Magnitude,Area,Source,Country,Date,Time
0,2023-06-04 22:55:03+00:00,1665492389220487168,"1.6 magnitude #earthquake. 38 km from Valdez, ...",QuakesToday,1.6,38 km from Valdez,https://t.co/kXHXTDgnNx,UnitedStates,2023-06-04,22:55:03
1,2023-06-04 22:40:03+00:00,1665488613130514432,1.7 magnitude #earthquake. 84 km ESE of Egegik...,QuakesToday,1.7,84 km ESE of Egegik,https://t.co/Sdx0vUplys,Alaska,2023-06-04,22:40:03
2,2023-06-04 22:23:03+00:00,1665484336316174339,"5.0 magnitude #earthquake. 122 km from Itoman,...",QuakesToday,5.0,122 km from Itoman,https://t.co/lOyz1MVMIE,Japan,2023-06-04,22:23:03
3,2023-06-04 21:03:03+00:00,1665464203505975296,4.4 magnitude #earthquake. 222 km SSW of Dunhu...,QuakesToday,4.4,222 km SSW of Dunhuang,https://t.co/PRolgouAiG,China,2023-06-04,21:03:03
4,2023-06-04 20:27:03+00:00,1665455143066963968,"4.1 magnitude #earthquake. 8 km from Abricots,...",QuakesToday,4.1,8 km from Abricots,https://t.co/XJta3g29Mu,Haiti,2023-06-04,20:27:03


## Deleting unwanted columns

Now after all the editing, we have to delete the unwanted columns.

So, here I'll drop "Text", "Username", "Datetime" columns.

In [121]:
df.drop(["Text","Username","Datetime"], axis=1, inplace=True)

df.head()

Unnamed: 0,Tweet Id,Magnitude,Area,Source,Country,Date,Time
0,1665492389220487168,1.6,38 km from Valdez,https://t.co/kXHXTDgnNx,UnitedStates,2023-06-04,22:55:03
1,1665488613130514432,1.7,84 km ESE of Egegik,https://t.co/Sdx0vUplys,Alaska,2023-06-04,22:40:03
2,1665484336316174339,5.0,122 km from Itoman,https://t.co/lOyz1MVMIE,Japan,2023-06-04,22:23:03
3,1665464203505975296,4.4,222 km SSW of Dunhuang,https://t.co/PRolgouAiG,China,2023-06-04,21:03:03
4,1665455143066963968,4.1,8 km from Abricots,https://t.co/XJta3g29Mu,Haiti,2023-06-04,20:27:03


In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13049 entries, 0 to 13048
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Tweet Id   13049 non-null  int64 
 1   Magnitude  13049 non-null  object
 2   Area       13049 non-null  object
 3   Source     13049 non-null  object
 4   Country    12451 non-null  object
 5   Date       13049 non-null  object
 6   Time       13049 non-null  object
dtypes: int64(1), object(6)
memory usage: 713.7+ KB


## Treating the missing values

After cleaning the data, we are good to go for treating the missing values present in the dataset.

In [123]:
df.isnull().sum()

Tweet Id       0
Magnitude      0
Area           0
Source         0
Country      598
Date           0
Time           0
dtype: int64

In [124]:
df.Country.value_counts()

 United States                        4501
 UnitedStates                         2920
 Alaska                               1820
 Puerto Rico                           776
 PuertoRico                            354
                                      ... 
 Jamaica                                 1
 Guadeloupe                              1
 Bangladesh                              1
 Montenegro                              1
 Democratic Republic of the Congo        1
Name: Country, Length: 118, dtype: int64

In [125]:
#Editing some strings in country column.

df["Country"] = df["Country"].str.replace("UnitedStates", "United States")

In [126]:
df["Country"] = df["Country"].str.replace("PuertoRico", "Puerto Rico")

In [127]:
df["Country"].value_counts()

 United States                        7421
 Alaska                               1820
 Puerto Rico                          1130
 Indonesia                             220
 Japan                                 151
                                      ... 
 Guadeloupe                              1
 Zambia                                  1
 Bangladesh                              1
 Montenegro                              1
 Democratic Republic of the Congo        1
Name: Country, Length: 116, dtype: int64

In [128]:
#Calculating number of missing values    

df.isnull().sum()

Tweet Id       0
Magnitude      0
Area           0
Source         0
Country      598
Date           0
Time           0
dtype: int64

In [106]:
#Calculating percentage of missing values in Country column.

round(100*(df.Country.isnull().sum()/len(df.index)),2)

4.58

Only Country column contains the missing values. 

The percentage of missing values (4.58) is very low as compared to the length of Dataset.

So, deleting all the rows having missing values won't cause us much. 

In [110]:
#Droping rows having missing values.

df = df[~df["Country"].isnull()]

In [108]:
df.isnull().sum()

Tweet Id     0
Magnitude    0
Area         0
Source       0
Country      0
Date         0
Time         0
dtype: int64

In [109]:
round(100*(df.Country.isnull().sum()/len(df.index)),2)

0.0

Hence, now all the data is cleaned perfectly.

We, also have manipulated the dataset as we wanted

Now, we are good to start doing visualizations on this dataset. 
For that first I'll convert this dataframe into 'csv' format by using 'df.to_csv'.

Then upload the csv file in Power BI desktop for further visualization.