### <p style="text-align: right;"> &#9999; Krrish Kishore Kumar</p>


# 21B - Understanding the raw data

<img src="https://assets.datamation.com/uploads/2023/12/dm_20231214-raw-data.png" width=300px>

As we work through the data, think about how to glean the facts from the data. Remember too what you learned over the last week about quality presentations. Consider how you can improve your visuals as you code and plot the data.

The data set you will be analyzing is from the 'Flint Water Study' and can be reviewed [here](http://flintwaterstudy.org/2015/12/complete-dataset-lead-results-in-tap-water-for-271-flint-samples/). This is a dataset of nearly 300 tests run by volunteers at Virginia Tech on water samples obtained from Flint residents. For more information on the research, look [here](http://flintwaterstudy.org/about-page/about-us/).

As you work through the data, keep in mind the U.S. Environmental Protection Agency (EPA) guidelines about lead contaminants, which state:

> Lead and copper are regulated by a treatment technique that requires systems to control the corrosiveness of their water. If more than 10% of tap water samples exceed the action level, water systems must take additional steps. For copper, the action level is 1.3 mg/L, and for lead is 0.015 mg/L. 
>
> Source: (http://www.epa.gov/your-drinking-water/table-regulated-drinking-water-contaminants#seven). 

**Make sure that you understand what the EPA guidelines mean. Your work throughout this assignment will be to analyze whether not those guidelines have been met.**

----
The first step, **after** you load the commonly used modules, is to load the data.

In [1]:
# Load the common modules here

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# Loading the data (review another way together)
df_table = pd.read_table('/workspaces/codespaces-jupyter/data/flint.csv', delimiter=  
df_table.head()

Unnamed: 0,SampleID,Zip Code,Ward,Pb Bottle 1 (ppb) - First Draw,Pb Bottle 2 (ppb) - 45 secs flushing,Pb Bottle 3 (ppb) - 2 mins flushing,Notes
0,1,48504,6,0.344,0.226,0.145,
1,2,48507,9,8.133,10.77,2.761,
2,4,48504,1,1.111,0.11,0.123,
3,5,48507,8,8.007,7.446,3.384,
4,6,48505,3,1.951,0.048,0.035,


The titles of the test columns are too long for the dataset to be read easily. So, change the column titles to: 'Pb Test1 ppb', 'Pb Test2 ppb', and 'Pb Test3 ppb'. We will use the 'rename' method to rename the columns of a dataset in Pandas.
Once you have done so, print the **head 5 rows and the last 5 rows** to ensure it worked properly.

In [None]:
# Renaming the test columns

df_table.rename(columns={
    'Pb Bottle 1 (ppb) - First Draw': 'Pb Test1 ppb',
    'Pb Bottle 2 (ppb) - 45 secs flushing': 'Pb Test2 ppb',
    'Pb Bottle 3 (ppb) - 2 mins flushing': 'Pb Test3 ppb'
}, inplace=True)


# Print 5 header rows

df_table.head()


Unnamed: 0,SampleID,Zip Code,Ward,Pb Test1 ppb,Pb Test2 ppb,Pb Test3 ppb,Notes
0,1,48504,6,0.344,0.226,0.145,
1,2,48507,9,8.133,10.77,2.761,
2,4,48504,1,1.111,0.11,0.123,
3,5,48507,8,8.007,7.446,3.384,
4,6,48505,3,1.951,0.048,0.035,


In [None]:
# Print 5 last rows

df_table.tail()

----
### Understanding the structure and details of the data

Determine the shape and size of the dataframe. Also, confirm the type of data for 'flint_tests'.

In [19]:
# We can use pandas to help us see what the data looks like

df_table.info()

rows, columns = df_table.shape
print("\n")
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271 entries, 0 to 270
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SampleID      271 non-null    int64  
 1   Zip Code      271 non-null    int64  
 2   Ward          271 non-null    int64  
 3   Pb Test1 ppb  271 non-null    float64
 4   Pb Test2 ppb  271 non-null    float64
 5   Pb Test3 ppb  271 non-null    float64
 6   Notes         4 non-null      object 
dtypes: float64(3), int64(3), object(1)
memory usage: 14.9+ KB


Number of rows: 271
Number of columns: 7


### Questions you should ask yourself about the data you loaded in

Before you begin to code, make sure you take time to understand the data and think about the following questions.

**These are just examples of the questions that you should `always` be thinking about when you look at data. This is important for every data scientist.** 

- What is the structure of the data I loaded?
  - How many columns and rows of data are there?
  - Do I understand what each column and row represents? 
  - Did the data load the way I expected?
  - Are there many gaps in the data or missing values?
- What *kind* of data is in each column and does it look like that *kind* should? 
  - Did the dates load as dates or integers?
  - Which columns are integers, floats, or strings?
  - Are there columns that do not contain data or data that is not useful?
- What things do I **_not_** understand*? (This is **important!** )
  - Who can I ask or where can I get clarification?
  - Does anything in the dataset look wrong? Can it be fixed?
  - When compared to the original file, is there missing data when it was read in?
  - Do some of the values seem out of the ordinary? Could these by typos, outliers, or something you should flag for follow-up?
  - How do I work around irregular data and still have a valid analysis? If this isn't possible, what are the next steps to fix the data?

### Overview of the statistics: describe

Use the `describe` method to review some of the basic statistics of the dataframe.


In [20]:
# Put your code here

df_table.describe()


Unnamed: 0,SampleID,Zip Code,Ward,Pb Test1 ppb,Pb Test2 ppb,Pb Test3 ppb
count,271.0,271.0,271.0,271.0,271.0,271.0
mean,150.856089,48505.103321,5.313653,10.645993,10.301144,3.660705
std,86.30895,3.114546,2.668291,21.560778,67.531251,10.5385
min,1.0,48502.0,0.0,0.344,0.032,0.031
25%,77.5,48503.0,3.0,1.578,0.46,0.306
50%,149.0,48505.0,6.0,3.521,1.4,0.831
75%,224.5,48506.0,8.0,9.05,4.8065,2.7405
max,300.0,48532.0,9.0,158.0,1051.0,94.52


Look back at the information above.
- What kind of information is `describe()` giving you?
- Does it make sense for all the types of data in the 'flint_tests' dataset?
- Which information, if any, does the `describe()` method provide that isn't useful?

<font size=+2> &#9999;</font> Describe is giving me the basic statistics of the different columns in teh data frame. It does not make sense for all the types of data in the flint tests dataset as the column for Additional notes cannot have statistics if theres no numerical data in that column to begin with. The describe method is not useful in the Sample ID, Zipcode, and Ward columns as those are not columns for the data itself but rather just data labelling (labelling all the data points with their zipcode, id and ward.)

#### Go back to the dataset and fill in descriptions that explain what each of the columns data represent.

<font size=+2> &#9999;</font>
- *SampleID*: They represent the index number of each of the entires in the dataset
- *Zip Code*: The zipcode of where the data came from
- *Ward*: The division of Flint where the data came from
- *Pb Test1 ppb*: The amount of lead (in parts per billion) in a bottle initally 
- *Pb Test2 ppb*: The amount of lead (in parts per billion) in a bottle after the first flush
- *Pb Test3 ppb*: The amount of lead (in parts per billion) in a bottle after the second flush



In the space below, type in your key learnings from this activity.

- I learned how to load and inspect data using pandas.
- I understood the importance of renaming columns for better readability.
- I gained insights into the structure and types of data in the dataset.
- I learned to use the describe method to get basic statistics of the dataset.
- I recognized the importance of understanding the context and details of each column in the dataset.