# PROJECT 02 :
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üåÄCOLUMNS TO CLEAN : 
-----

**- COUNTRY (global comparison between countries to invest in more)**@Blanca

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

**- DATE ( MONTH + YEAR )**@Blanca

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

**- GENDER ( F or M )** @Cecilia

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

**- AGE (victims age ranges)** 

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

**- FATALITY ( Y or N )** @Cecilia

    * mostly survived for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

**- INJURY TYPE**

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well



## CLEAN THE DATA
-------

In [45]:
#import data

import pandas as pd

data_shark = pd.read_csv("/Users/blanca/Library/Mobile Documents/com~apple~CloudDocs/IRONHACK/github/02_project-shark-attacks/data/raw.csv")

data_shark.columns =[col.upper() for col in data_shark.columns]

In [47]:
# COUNTRY

data_shark = data_shark.dropna(subset=["COUNTRY"])
data_shark["COUNTRY"] = data_shark["COUNTRY"].str.strip().str.title()

country_replacements = {
    "United States Of America": "USA",
    "Usa": "USA",
    "Us": "USA",
    "U.s.": "USA",
    "United Kingdom": "UK",
    "England": "UK",
    "Uk":"UK",
    "Brasil": "Brazil",
    "M√©xico": "Mexico",
}

data_shark["COUNTRY"] = data_shark["COUNTRY"].replace(country_replacements)
data_shark = data_shark[data_shark["COUNTRY"] != "Italy / Croatia"]

print(data_shark["COUNTRY"].unique())

['Australia' 'Bahamas' 'Seychelles' 'Argentina' 'Costa Rica' 'Brazil'
 'Egypt' 'French Polynesia' 'USA' 'Colombia' 'South Africa' 'Spain'
 'Comoros' 'Mexico' 'Cape Verde' 'Cayman Islands' 'Italy' 'Jamaica'
 'Trinidad & Tobago' 'Canada' 'Croatia' 'Saudi Arabia' 'Antigua'
 'United Arab Emirates (Uae)' 'Guam' 'Nevis' 'Japan' 'New Zealand'
 'British Virgin Islands' 'Senegal' 'Belize' 'Liberia' 'Honduras'
 'Sri Lanka' 'Indonesia' 'New Caledonia' 'Madagascar' 'Malaysia' 'Tonga'
 'Bermuda' 'Montenegro' 'Somalia' 'Greece' 'Mozambique' 'Papua New Guinea'
 'Tanzania' 'Panama' 'Philippines' 'Atlantic Ocean' 'Johnston Island'
 'Marshall Islands' 'Caribbean Sea' 'Turkey' 'Cuba' 'Guatemala'
 'North Atlantic Ocean' 'North Pacific Ocean' 'Pacific Ocean'
 'Indian Ocean' 'UK' 'Israel' 'India' 'Haiti' 'Yemen' 'Crete' 'France'
 'Syria' 'Azores' 'Fiji' 'Guyana' 'China' 'Norway' 'Iceland' 'Roatan'
 'Guinea']


In [56]:
# DATE 

#TO HAVE THE COMPLEATE DATE.
data_shark["Full_Date"] = data_shark["DATE"].astype(str) + " " + data_shark["YEAR"].astype(str)

#WE ELIMINATE SUFIX AS 1ST , 2ND ....
data_shark["Full_Date"] = (
    data_shark["Full_Date"]
    .str.replace("[-_/]", " ", regex=False)
    .str.replace("st", "", regex=False)
    .str.replace("nd", "", regex=False)
    .str.replace("rd", "", regex=False)
    .str.replace("th", "", regex=False)
    .str.strip()
)   

#WE ASK PANDA TO TRY TO READ THE DATE 
data_shark["Full_Date"] = pd.to_datetime(
    data_shark["Full_Date"], 
    format="%d %b %Y",   
    errors="coerce", 
    dayfirst=True
)

#WE CREATE OUR COLUMNS
data_shark["DAY"] = data_shark["Full_Date"].dt.day
data_shark["MONTH"] = data_shark["Full_Date"].dt.month_name()  # Example: October, February
data_shark["YEAR"] = data_shark["Full_Date"].dt.year

print(data_shark[["DATE", "DAY", "MONTH", "YEAR"]].head(15))
data_shark.head(10)


           DATE  DAY MONTH  YEAR
0   25 Aug 2023  NaN   NaN   NaN
1   21 Aug-2023  NaN   NaN   NaN
2   07-Jun-2023  NaN   NaN   NaN
3   02-Mar-2023  NaN   NaN   NaN
4   18-Feb-2023  NaN   NaN   NaN
5   08-Feb-2022  NaN   NaN   NaN
6   15-Nov-2021  NaN   NaN   NaN
7   16-Oct-2021  NaN   NaN   NaN
8   10-Sep-2021  NaN   NaN   NaN
9   29-Jul-2020  NaN   NaN   NaN
10  06-Jun-2020  NaN   NaN   NaN
11  20-Dec-2019  NaN   NaN   NaN
12  21-Oct-2019  NaN   NaN   NaN
13  28-Jul-2019  NaN   NaN   NaN
14  18-Jul-2019  NaN   NaN   NaN


Unnamed: 0,DATE,YEAR,TYPE,COUNTRY,STATE,LOCATION,ACTIVITY,NAME,SEX,AGE,INJURY,FATAL Y/N,TIME,SPECIES,SOURCE,Full_Date,DAY,MONTH
0,25 Aug 2023,,Unprovoked,Australia,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF",NaT,,
1,21 Aug-2023,,Questionable,Bahamas,New Providence Isoad,"Saunders Beach, Nassau",,male,M,20/30,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023",NaT,,
2,07-Jun-2023,,Unprovoked,Bahamas,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF",NaT,,
3,02-Mar-2023,,Unprovoked,Seychelles,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023",NaT,,
4,18-Feb-2023,,Questionable,Argentina,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023",NaT,,
5,08-Feb-2022,,,Costa Rica,Guanacoste,Playa Del Coco,Diving,female,F,50,Right forearm and left hand injured,,,"Bull shark, 3m",Diario Extra Del Costa Rica 2/9/2022,NaT,,
6,15-Nov-2021,,Unprovoked,Brazil,S√£o Paulo.,Boqueir√£o Beach,Playing,male,M,11,Minor cuts to left thigh,,12h00,dogfish,"K. McMurray, TrackingSharks.com",NaT,,
7,16-Oct-2021,,,Australia,Queensland,Sudbury Island,Spearfishing,Torrance Sambo,M,26,Disappeared,,,,"K. McMurray, TrackingSharks.com",NaT,,
8,10-Sep-2021,,,Egypt,,Sidi Abdel Rahmen,Swimming,Mohamed,M,,Laceration to arm caused by metal object,,,No shark invovlement,Dr. M. Fouda & M. Salrm,NaT,,
9,29-Jul-2020,,Watercraft,Australia,Tasmania,Tenth Island,Sightseeing,5.5 m runabout. Occupants: Sean & James Vinar,,,"No injury to occupants, injury to shark attemp...",,09h08,"White shark, 4m","C. Black, GSAF",NaT,,
