### Data Wrangling Project: Sharks Attack Dataset

#### Instructions

Welcome to the final project of this data wrangling module! In this project, you will get a chance to work through the entire data wrangling workflow while preparing the shark_attacks.csv file for analysis. This dataset contains very dirty data and will require a lot of work! This project is broken down into key steps of the data wrangling process to help guide you along the process. When you are finished, save the wrangled dataset as a final_project.csv file. Submit the final project as a zip folder named final_project.zip. Make sure the zipped folder has both your wrangled dataset and this word document within it. Best of luck!

In [1]:
import pandas
import numpy

df = pandas.read_csv("shark_attacks.csv")
df.head(3)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order
0,2017.06.11,2017-06-11,2017.0,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,Paul Goff,M,...,N,08h30,"White shark, 4 m","WA Today, 6/11/2017",2017.06.11-Goff.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2017.06.11,2017.06.11,6095
1,2017.06.10.b,2017-06-10,2017.0,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,female,F,...,N,15h45,7 gill shark,,2017.06.10.b-Flinders.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2017.06.10.b,2017.06.10.b,6094
2,2017.06.10.a,2017-06-10,2017.0,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,Bryan Brock,M,...,N,10h00,,"Daytona Beach News-Journal, 6/10/2017",2017.06.10.a-Brock.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2017.06.10.a,2017.06.10.a,6093


#### Step 2: Data Inspection & Step 3: Data Cleaning

Inspect the dataset. In the box below, discuss the following:
1. Are there any irrelevant columns? Which ones?
2. Are there any duplicates?
3. Which columns have missing data? 
4. For each column with missing data, describe what you think the best way to handle that missing data is, and why?
5. Are there any errors? Describe any you find.
6. Is there anything else that requires data cleaning attention? 
(12 marks)


--------------

1. are there any irrelevant columns? Which ones?

In [2]:
##  inspect data columns
df.columns
# Ans: Case Number, Year, Investigator or Source, pdf, href formula, href, 
#      Case Number.1, Case Number.2

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

In [3]:
# remove irrelevant columns
df1 = df.drop(columns=["Case Number","Year","Name", "Investigator or Source", "pdf",
                "Injury", "href formula", "href", "Case Number.1", "Case Number.2"])
df1.head(3)

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48.0,N,08h30,"White shark, 4 m",6095
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,,N,15h45,7 gill shark,6094
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19.0,N,10h00,,6093


2. Are there any duplicates?

In [4]:
df1[df1.duplicated()]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
662,2012-04-03,Unprovoked,USA,Hawaii,"Leftovers near Chun's Reef, Oahu",Surfing,M,28,N,12h38,"Tiger shark, 10'",5434
958,2009-08-29,Unprovoked,SOUTH AFRICA,Western Cape Province,Glentana,Surfing,M,25,Y,15h30,White shark,5139


In [5]:
# check and remove duplicated rows
print(f"{df1.duplicated().sum()} rows duplicated...")

# drop duplicated rows if any
print("duplicated rows are removed")
df1.drop_duplicates(inplace=True)

2 rows duplicated...
duplicated rows are removed


3. Which columns have missing data? 

In [6]:
missing_col = pandas.DataFrame([df1.isna().sum(), df1.isna().sum() 
                                /df1.shape[0]], index=["Missing","Missing (%)"]).T
missing_col["Missing (%)"]= missing_col["Missing (%)"].round(4)
missing_col

# Ans: Year, Type, Country, Area, Location, Activity, Name, 
#      Sex, Agge, Injury, Fatal (Y/N), Time, Species

Unnamed: 0,Missing,Missing (%)
Date,0.0,0.0
Type,2.0,0.0017
Country,1.0,0.0009
Area,56.0,0.0484
Location,59.0,0.051
Activity,52.0,0.0449
Sex,43.0,0.0372
Age,289.0,0.2498
Fatal (Y/N),7.0,0.0061
Time,328.0,0.2835


4. For each column with missing data, describe what you think the best way to handle that missing data is, and why?

In [7]:
df2 = df1.copy(deep=True)

In [8]:
missing_col = pandas.DataFrame([df1.isna().sum(), df1.isna().sum() 
                                /df1.shape[0]], index=["Missing","Missing (%)"]).T
missing_col["Missing (%)"]= missing_col["Missing (%)"].round(4)
missing_col

Unnamed: 0,Missing,Missing (%)
Date,0.0,0.0
Type,2.0,0.0017
Country,1.0,0.0009
Area,56.0,0.0484
Location,59.0,0.051
Activity,52.0,0.0449
Sex,43.0,0.0372
Age,289.0,0.2498
Fatal (Y/N),7.0,0.0061
Time,328.0,0.2835


In [9]:
# filter out the columns with missing values larger than 15%
missing_col[missing_col["Missing (%)"] > 0.15]

Unnamed: 0,Missing,Missing (%)
Age,289.0,0.2498
Time,328.0,0.2835
Species,447.0,0.3863


In [10]:
# replace null values with "Unknown" or "Other" value
print("Missing values of columns was filled with 'Unknown' or 'Other' value")
df3 = df2.copy(deep=True)
df3["Age"].fillna("Unknown", inplace=True)
df3["Species"] = df3["Species"].replace(numpy.nan, "Other")
df3["Time"] = df3["Time"].replace(numpy.nan, "Unknown")

df3["Type"] = df3["Type"].replace(numpy.nan, "Unknown")
df3["Sex"] = df3["Sex"].replace(numpy.nan, "Unknown")
df3["Country"] = df3["Country"].replace(numpy.nan, "Other")
df3["Location"] = df3["Location"].replace(numpy.nan, "Other")
df3["Area"] = df3["Area"].replace(numpy.nan, "Other")
df3["Activity"] = df3["Activity"].replace(numpy.nan, "Unknown")

df3["Fatal (Y/N)"] = df3["Fatal (Y/N)"].replace(numpy.nan, "Unknown")
df3["Fatal (Y/N)"] = df3["Fatal (Y/N)"].replace("UNKNOWN", "Unknown")
df3.head(3)

Missing values of columns was filled with 'Unknown' or 'Other' value


Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48,N,08h30,"White shark, 4 m",6095
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,Unknown,N,15h45,7 gill shark,6094
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19,N,10h00,Other,6093


In [11]:
# for other columns, drop missing rows as it's smaller number of rows size
print(f"Before dropping missing values, total rows is {df3.shape[0]}")
df3.dropna(inplace=True)

print(f"After dropping missing values, total rows is {df3.shape[0]}")
print("No rows are dropped...")

Before dropping missing values, total rows is 1157
After dropping missing values, total rows is 1157
No rows are dropped...


5. Are there any errors? Describe any you find.

(i) Inconsistent date entry

In [12]:
df4 = df3.copy(deep=True)

# check inconsistent date entry
df4[df4["Date"].str.contains("Reported")]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
3,Reported 07-Jun-2017,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,M,30,N,Unknown,"3m shark, probably a smooth hound",6092
10,Reported 06-May-2017,Provoked,AUSTRALIA,Queensland,Weipa,Attempting to lasso a shark,M,29,N,Unknown,9' shark,6085
37,Reported 09-Mar-2017,Unprovoked,BAHAMAS,Great Exuma,Other,Washing hands,M,58,N,Unknown,Lemon shark,6058
52,Reported 08-Jan-2017,Invalid,AUSTRALIA,Queensland,Other,Spearfishing,M,35,Unknown,Unknown,Bull shark,6043
113,Reported 14-Jul-2016,Unprovoked,BAHAMAS,Other,Tiger Beach,Scuba Diving,M,Unknown,N,Unknown,"Lemon shark, 9'",5982
...,...,...,...,...,...,...,...,...,...,...,...,...
1133,Reported 19-Apr-2008,Invalid,SOUTH AFRICA,KwaZulu-Natal,Aliwal Shoal,Free-diving,M,Unknown,N,Unknown,"Tiger shark, 13' female",4964
1137,Reported 09-Apr-2008,Unprovoked,FIJI,Other,Other,Spearfishing,M,Unknown,N,Unknown,Other,4960
1139,Reported 08-Apr-2008,Unprovoked,USA,Florida,"1.4 miles south of Ponce de Leon Jetty, New Sm...",Surfing,Unknown,Unknown,N,Unknown,Other,4958
1150,Reported 21-Feb-2008,Unprovoked,FRENCH POLYNESIA,Society Islands,Tahiti,Spearfishing,M,26,N,Unknown,Other,4947


In [13]:
# check inconsistent date entry - continued
## replace "Reported " word in date column
df4["Date"] = df4["Date"].str.replace("Reported ","") 
df4[~ pandas.to_datetime(df4["Date"], errors="coerce").notna()]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
746,16-Aug--2011,Unprovoked,USA,Puerto Rico,Vieques,Floating,M,27,N,Night,Other,5350
749,11-Aug--2011,Unprovoked,USA,North Carolina,Beaufort Inlet,Swimming,M,54,N,14h00,Other,5347
897,190Feb-2010,Unprovoked,NEW ZEALAND,South Island,Tahunanui Beach,Swimming,M,Unknown,N,Night,Possibly a blue shark,5199
1101,Late Jul-2008,Boat,UNITED KINGDOM,Sussex,"Rock-a-Nore, Hastings",Rowing an inflatable dinghy,M,16,N,Unknown,"Starry smoothhound shark, 1m",4996


In [14]:
# clean the "Date" column
df4["Date"] = df4["Date"].str.replace("--","-")
df4["Date"] = df4["Date"].str.replace("190Feb-2010","19-Feb-2010")
df4.drop(index=1101, inplace=True) # drop wrong date entry
# check if the inconsistent date format is resolved or not
df4[~ pandas.to_datetime(df4["Date"], errors="coerce").notna()]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order


In [15]:

df4["Date"] = pandas.to_datetime(df4["Date"])
df4["Date"]

0      2017-06-11
1      2017-06-10
2      2017-06-10
3      2017-06-07
4      2017-06-04
          ...    
1154   2008-01-30
1155   2008-01-29
1156   2008-01-27
1157   2008-01-19
1158   2008-01-14
Name: Date, Length: 1156, dtype: datetime64[ns]

(ii) Type error

In [16]:
# Note: Fatal (Y/N) should only have "Y" and "N" values
df5 = df4.copy(deep=True)
df5["Fatal (Y/N)"].unique()

array(['N', 'Y', 'Unknown', '2017'], dtype=object)

In [17]:
# filter out rows from df5 where the "Fatal (Y/N)" column contains values other than "Y" , "N" and "Unknown"
df5[~ df5["Fatal (Y/N)"].isin(["Y","N", "Unknown"])]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
646,2012-06-10,Provoked,ITALY,Sardinia,Muravera,Attempting to rescue an injured & beached shark,M,57,2017,Morning,"Blue shark, 2.5m",5449


In [18]:
# Remove rows from df6 where the "Fatal (Y/N)" column contains values other than "Y", "N" and "Unknown"
df6 = df5.drop(index=df5[~ df5["Fatal (Y/N)"].isin(["Y","N", "Unknown"])].index)
print(f"Before excluding rows where Fatal (Y/N) column containing values other than 'Y' and 'N', the shape of dataframe: {df5.shape}")
print(f"After excluding rows where Fatal (Y/N) column containing values other than 'Y' and 'N', the shape of dataframe: {df6.shape}")

Before excluding rows where Fatal (Y/N) column containing values other than 'Y' and 'N', the shape of dataframe: (1156, 12)
After excluding rows where Fatal (Y/N) column containing values other than 'Y' and 'N', the shape of dataframe: (1155, 12)


(iii) Clean the age column

In [19]:
# check what error entries in age column
pandas.DataFrame(df6[~ df6["Age"].str.isdigit() & ~ df6["Age"].isin(["Unknown"])]["Age"].unique(), columns=["Age"])

Unnamed: 0,Age
0,20s
1,Teen
2,60s
3,18 months
4,40s
5,30s
6,50s
7,teen
8,28 & 26


In [20]:
# clean the age column
def clean_age_data(age):
    # Replace "s" with an empty string, if present
    ## Example of age data: 40s, 50s
    age = str(age).replace("s","")
    # Check if the age contains an "&" symbol (e.g., "28 & 26")
    if len(age.split("&")) > 1:
        age_list = [float(age) for age in age.split("&")]
        age = str(sum(age_list)/len(age_list))
    # Check if the age ends with "month" (e.g., "18 month")
    if age.endswith("month"):
        age = str(float((age.split(" "))[0])/12)
    # Check if the age is "teen" or "Teen"
    if age in ["teen", "Teen"]:
        # teen is around 13 - 17 years old
        return str((13+17)/2)
    return str(age)

df6["Age"] = df6['Age'].apply(clean_age_data)
df6.head()

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48,N,08h30,"White shark, 4 m",6095
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,Unknown,N,15h45,7 gill shark,6094
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19,N,10h00,Other,6093
3,2017-06-07,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,M,30,N,Unknown,"3m shark, probably a smooth hound",6092
4,2017-06-04,Unprovoked,USA,Florida,"Middle Sambo Reef off Boca Chica, Monroe County",Spearfishing,M,Unknown,N,Unknown,8' shark,6091


(iv) Clean the Time column

In [21]:
# check which inconsistent time entries
pandas.DataFrame(df6[~df6["Time"].str.contains("(?:^\d{1,2}h\d{2}|Unknown)", regex=True)]["Time"].unique(), columns=["Time"])

Unnamed: 0,Time
0,Shortly before 12h00
1,Morning
2,Afternoon
3,After noon
4,Late afternoon
5,1300
6,Midnight
7,Evening
8,Night
9,Sometime between 06h00 & 08hoo


In [22]:
df7 = df6.copy(deep=True)
# Replace time values in the "Time" column with a formatted version (e.g., "1200" to "12h00")
df7["Time"] = df7["Time"].str.replace('(\d{1,2})(\d{2})', r'\1h\2', regex=True)
# replace wrong time format like "15j45" into "15h45"
df7["Time"] = df7["Time"].str.replace("(\d{1,2})\w(\d{2})", r"\1h\2", regex=True)
# extract the time from long sentence such as "Shortly before 12h00" and "Before 07h00"
df7["Time"] = df7["Time"].str.extract("(\d{1,2}h\d{2})")
# replace null values with "Unknown" value
df7["Time"] = df7["Time"].fillna("Unknown")
df7["Time"]

0         08h30
1         15h45
2         10h00
3       Unknown
4       Unknown
         ...   
1154    Unknown
1155    Unknown
1156      07h30
1157    Unknown
1158      14h00
Name: Time, Length: 1155, dtype: object

(v) Remove the comma inside "Area" column

In [23]:
df8 = df7.copy(deep=True)

In [24]:
# filter out the rows with the comma inside "Area" column
df8[df8["Area"].str.strip().str.endswith(",")]

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
9,2017-05-12,Unprovoked,UNITED ARAB EMIRATES,"Sharjah,",Khor Fakkan,Spearfishing,M,41,N,Unknown,Other,6086
1061,2008-10-06,Unprovoked,CROATIA,"Split-Dalmatia Count,","Smokvina Bay, Vis Island",Spearfishing,M,43,N,12h00,5 m white shark,5036


In [25]:
# remove the comma inside "Area" column
df8["Area"] = df8["Area"].str.strip().str.replace(",$", "", regex=True)
df8.head()

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48,N,08h30,"White shark, 4 m",6095
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,Unknown,N,15h45,7 gill shark,6094
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19,N,10h00,Other,6093
3,2017-06-07,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,M,30,N,Unknown,"3m shark, probably a smooth hound",6092
4,2017-06-04,Unprovoked,USA,Florida,"Middle Sambo Reef off Boca Chica, Monroe County",Spearfishing,M,Unknown,N,Unknown,8' shark,6091


## Step 4: Data Cleaning Validation

In [26]:
# double check the data entries in csv file
df8.to_csv("cross_check_data.csv")

In [27]:
pandas.DataFrame(df8[~df8["Time"].str.contains("(?:^\d{1,2}h\d{2}|Unknown)", regex=True)]["Time"].unique(), columns=["Time"])

Unnamed: 0,Time


In [28]:
# check and validate Age column
pandas.DataFrame(df8[~ df8["Age"].str.isdigit() & df8["Age"].str.isdecimal() & ~ df8["Age"].isin(["Unknown"])]["Age"].unique(), columns=["Age"])

Unnamed: 0,Age


## Step 5: Data Enrichment

With the dataset cleaned it’s time to enrich the data:
- Make an address column, by combining the Location, Area and Country columns together (this might affect your missing value strategy!). 
- Add a new column, call it “Shark”. Extract information from the Species column. If the species text mentions the word “white”, make the “Shark” column value “Great White”. If the text mentions “bull”, make the “Shark” column value “Bull”. Otherwise, if neither of the words found, make the value “Other”. (Hint: make sure the species column is all lowercase).


In [29]:
df9 = df8.copy(deep=True)

In [30]:
# Make an address column, by combining the Location, Area and Country columns together
def combine_address(location, area, country):
    if "Other" in location:
        return "Unknown"
    elif "Other" in area:
        return "Unknown"
    elif "Other" in country:
        return location + ", " + area + ", "
    else:
        return location + ", " + area + ", " + country        

df9["Address"] = df8.apply(lambda x: combine_address(x["Location"], x["Area"], x["Country"]), axis=1)
df9

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order,Address
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48,N,08h30,"White shark, 4 m",6095,"Point Casuarina, Bunbury, Western Australia, A..."
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,Unknown,N,15h45,7 gill shark,6094,"Flinders, Mornington Penisula, Victoria, AUSTR..."
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19,N,10h00,Other,6093,"Ponce Inlet, Volusia County, Florida, USA"
3,2017-06-07,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,M,30,N,Unknown,"3m shark, probably a smooth hound",6092,"Bantham Beach, South Devon, UNITED KINGDOM"
4,2017-06-04,Unprovoked,USA,Florida,"Middle Sambo Reef off Boca Chica, Monroe County",Spearfishing,M,Unknown,N,Unknown,8' shark,6091,"Middle Sambo Reef off Boca Chica, Monroe Count..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1154,2008-01-30,Provoked,JAPAN,Tokyo Bay,Other,Diving,Unknown,Unknown,Unknown,Unknown,"Goblin shark, 4.2'",4943,Unknown
1155,2008-01-29,Unprovoked,SOUTH AFRICA,KwaZulu-Natal,"Suncoast Pirates Beach, Durban",Surf-skiing,M,42,N,Unknown,"Blacktip shark, 2m",4942,"Suncoast Pirates Beach, Durban, KwaZulu-Natal,..."
1156,2008-01-27,Provoked,AUSTRALIA,Queensland,200 km east of Coolangatta,Accidentally stood on hooked shark's tail befo...,M,20,N,07h30,"Mako shark, 90kg",4941,"200 km east of Coolangatta, Queensland, AUSTRALIA"
1157,2008-01-19,Invalid,NEW ZEALAND,South Island,Marfells Beach,Wading,M,Unknown,N,Unknown,No shark involvement,4940,"Marfells Beach, South Island, NEW ZEALAND"


In [31]:
# Add a new column, call it “Shark”. Extract information from the Species column. If the species text mentions the word “white”, 
# make the “Shark” column value “Great White”. If the text mentions “bull”, make the “Shark” column value “Bull”. 
# Otherwise, if neither of the words found, make the value “Other”. 

def categorize_species(x):
    if "white" in x.lower():
        return "Great White"
    elif "bull" in x.lower():
        return "Bull"
    else:
        return "Other"

df9["Species"] = df8["Species"].apply(categorize_species)
df9

Unnamed: 0,Date,Type,Country,Area,Location,Activity,Sex,Age,Fatal (Y/N),Time,Species,original order,Address
0,2017-06-11,Unprovoked,AUSTRALIA,Western Australia,"Point Casuarina, Bunbury",Body boarding,M,48,N,08h30,Great White,6095,"Point Casuarina, Bunbury, Western Australia, A..."
1,2017-06-10,Unprovoked,AUSTRALIA,Victoria,"Flinders, Mornington Penisula",Surfing,F,Unknown,N,15h45,Other,6094,"Flinders, Mornington Penisula, Victoria, AUSTR..."
2,2017-06-10,Unprovoked,USA,Florida,"Ponce Inlet, Volusia County",Surfing,M,19,N,10h00,Other,6093,"Ponce Inlet, Volusia County, Florida, USA"
3,2017-06-07,Unprovoked,UNITED KINGDOM,South Devon,Bantham Beach,Surfing,M,30,N,Unknown,Other,6092,"Bantham Beach, South Devon, UNITED KINGDOM"
4,2017-06-04,Unprovoked,USA,Florida,"Middle Sambo Reef off Boca Chica, Monroe County",Spearfishing,M,Unknown,N,Unknown,Other,6091,"Middle Sambo Reef off Boca Chica, Monroe Count..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1154,2008-01-30,Provoked,JAPAN,Tokyo Bay,Other,Diving,Unknown,Unknown,Unknown,Unknown,Other,4943,Unknown
1155,2008-01-29,Unprovoked,SOUTH AFRICA,KwaZulu-Natal,"Suncoast Pirates Beach, Durban",Surf-skiing,M,42,N,Unknown,Other,4942,"Suncoast Pirates Beach, Durban, KwaZulu-Natal,..."
1156,2008-01-27,Provoked,AUSTRALIA,Queensland,200 km east of Coolangatta,Accidentally stood on hooked shark's tail befo...,M,20,N,07h30,Other,4941,"200 km east of Coolangatta, Queensland, AUSTRALIA"
1157,2008-01-19,Invalid,NEW ZEALAND,South Island,Marfells Beach,Wading,M,Unknown,N,Unknown,Other,4940,"Marfells Beach, South Island, NEW ZEALAND"


In [32]:
df9.to_csv("final_project.csv")