# Data Analyst Associate Practical Exam Submission

**You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.**

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the [Markdown Guide](https://s3.amazonaws.com/talent-assets.datacamp.com/Markdown+Guide.pdf) before you start.



## Task 1



**Data Validation**

### Dataset
The dataset used for this analysis can be accessed here: `"food_claims_2212.csv"`

| Column Name      | Criteria                                                                                                                                                                    |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| claim_id      | Nominal, unique identifier of the claim without missing values, same as the description. No cleaning is needed.                                  |
| time_to_close        | Discrete, number of days to close the claim.  256 unique values without missing values, from 76 to 518, same as the description. No cleaning is needed. |
| claim_amount       | Continuous, the initial claim requested in the currency of Brazil, rounded to 2 decimal places. No missing value, cleaning is required.                     |
| amount_paid     | Continuous. Final amount paid. In the currency of Brazil. Rounded to 2 decimal places. 36 missing values. Cleaning is needed.                     |
| location   | Nominal. Location of the claim, one of “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL”. Same as the description. No cleaning is needed.                 |
|  individuals_on_claims     | Discrete. Number of individuals on this claim. Minimum 1 person, same as the description. No cleaning is needed.                 |
|   linked_cases    |  Nominal. Whether this claim is linked to other cases. Either TRUE or FALSE. 26 missing values. Cleaning is needed.                       |
|  cause     |   Nominal. Cause of the food poisoning. One of “vegetable”, “meat” or “unknown”.  Cleaning is needed.                  |


### Report
1. The shape of the dataset was checked.
2. An overview of the dataset was obtained.
3. The dataset datatypes was obtained.
4. The number of unique values for each feature was assessed.
5. Missing values in the dataset was checked.
6. The summary statistics was obtained.


### Observation
1. The dataset consists of 2000 rows and 8 columns.
2. The dataset consists of float64(1), int64(3) and object(4) datatypes respectively.
3. The unique variables in the 'linked_cases' and 'cause' fields were assessed.
    - the linked_cases column contained missing values and the the  uniques values were in sentence case (True, False).
    - the cause column had 5 unique values ('unknown', 'meat', 'vegetable', ' Meat', 'VEGETABLES') were instead of 3 unique features ( 'vegetable',  'meat', or 'unknown') as provided by the data dictionary. This could be due to inconsistent data entry.
4. Missing values in the 'amount_paid' and 'linked_cases' fields were 1.8% and 1.3% respectively.
5. The 'claim_amount' was found to be an object instead of floating number due to the Brazilian currency symbol ('R$').


### Data Cleaning
1. The non-numeric characters ('R$') in 'claim_amount' was removed  to convert the feature to float.
2. Missing values in the 'amount_paid' field was replaced with the overall median amount paid as required.
3. There were 4 unique locations as provided in the data dictionary, no cleaning required.
4. Missing values in the 'linked_cases' field was replaced with 'FALSE' as required and the values in sentence case were converted to uppercase.
5. Some entries have cause formatted as ' Meat' or 'VEGETABLES' were replaced with the value 'meat' or 'vegetable' respectively.

After the data validation, the dataset contains 2000 rows and 8 columns without missing values.           

## Task 2

**Number of claims in each location**

The category 'RECIFE' has the most observation with 885 occurrences.

To determine whether the observations of the variables 'location' are balanced, the distribution of claims across different locations will be examined.
From the visualiation above the number of claims across locations is not similar. Therefore the observations are not balaned.

![task 2](task%202.png)

## Task 3

The majority of time to close for all claims was in the range 90 - 270 days.

![task 3](task%203.png)


## Task 4

From the boxplot below, we can see the time to close claims by location with SAO LUIS having more outliers.

![task 4](task%204.png)


## Further findings

**Percentage of Linked Cases**

The graph illustrates that 75.95% of the claims are unassociated with any other cases, whereas 24.05% are connected to other cases.

![task 5](task%205.png)

**Cause of Food Poisioning**

The category 'meat' is notable for having the highest occurrence count, totaling 957 instances, followed by 'unknown', and then 'vegetable'. The following chart displays the distribution of food poisoning cases across different locations.

![task 6](task%206.png)



![task 6_1](task%206_1.png)


**Unexplained Discrepancies in Food Poisoning Claims: Insights from FORTALEZA and SAO LUIS**

In FORTALEZA, claim ID 1015, involving 2 individuals, saw a time-to-close of 156 days. Despite an initial claim amount of `R$ 7301.46`, the final amount paid was `R$ 20105.7`, 
marking an unexplained increase of `R$ 12804.24`. The cause remains unknown, and no linked cases were identified.


Claim 584 from SAO LUIS took 237 days to close. The initial claim amount was `R$ 64823.83`, but only `R$ 20105.7` was paid. Despite involving 14 individuals, no linked cases were found. The reason for the claim, marked as unknown, resulted in a difference of `R$ 44718.13` between the claimed and paid amounts.

Further clarification is required to better understand the 'unknown' category and reveal potential insights.

![task 7](task%207.png)


## ✅ When you have finished...
-  Publish your Workspace using the option on the left
-  Check the published version of your report:
	-  Can you see everything you want us to grade?
    -  Are all the graphics visible?
-  Review the grading rubric. Have you included everything that will be graded?
-  Head back to the [Certification Dashboard](https://app.datacamp.com/certification) to submit your practical exam