# Data Analytics and Visualization (part 2)

## Data Cleaning and Preparation

Data Cleaning involves identifying and rectifying errors, inconsistencies, and inaccuracies within the dataset. By eliminating missing values, outliers, and redundant information, data quality is enhanced, leading to more accurate and reliable insights.

Data cleaning and preparation helps to improve the overall efficiency of data analysis processes by ensuring that the dataset is more standardized and easier to work with. It also helps to reduce the risk of making faulty decisions based on flawed data. 

### Handling Missing Values 

Missing values in a dataset can hinder analysis and modeling. Pandas provides functions to handle missing values, such as  **fillna()**, which allows us to fill ***NaN*** values with a specific value or method.


Let's first start by importing our libraries:

In [10]:
#Your code goes here

Now, let's practice filling ***NaN*** values using **fillna()**:

In [11]:
#Your code goes here

**[Question 2: Complete the 'handle_missing_values' function to handle missing values in the dataset]**

### Handling Outliers

Outliers are extreme values that can skew analysis and modeling results. Pandas can help us identify and handle outliers. In this example, we identify outliers using the **interquartile range (IQR)** method and remove them:

The **Interquartile Range (IQR)** is a measure of statistical dispersion, representing the range within which the middle 50% of the data lies. It is calculated as the difference between the 75th percentile (also called the third quartile, or Q3) and the 25th percentile (the first quartile, or Q1) of a dataset.

The IQR is useful in identifying the spread of data, and it is commonly used to detect outliers. Values that are significantly lower than Q1 or significantly higher than Q3 are often considered outliers.

Here's a breakdown of what the IQR represents in pandas and how to compute it:

**What IQR Represents**
-  Q1 (25th percentile): The value below which 25% of the data falls.
-  Q3 (75th percentile): The value below which 75% of the data falls.
-  IQR: The range between Q1 and Q3, calculated as IQR = Q3 - Q1.

**How to Calculate IQR in pandas:**
To calculate the IQR for a specific column in a pandas DataFrame, you can use the quantile method to get the 25th and 75th percentiles, and then subtract them to find the IQR.

In [12]:
#Your code goes here

**Code Explanation**
-  **q1** and **q3** are calculated using the **quantile()** function, representing the first and third quartiles of column ‘B’.
-  **q1** represents the value below which 25% of the data lies. For column ‘B’, **q1** would be the median of the first half of the sorted values, which is 15.
-  **q3** represents the value below which 75% of the data lies. For column ‘B’, **q3** would be the median of the second half of the sorted values, which is 50.
-  **iqr** (Interquartile Range) is computed as the difference between **q3** and **q1**.
-  **lower_bound** and **upper_bound** are calculated to define the thresholds beyond which data points are considered outliers. These bounds are defined as 1.5 times the IQR below q1 and above q3.
-  The line **df_no_outliers = df[(df['B'] >= lower_bound) & (df['B'] <= upper_bound)]** filters the DataFrame to keep only the rows where the values in column ‘B’ fall within the acceptable range, effectively removing the outliers.


**[Question 3: Complete the 'handle_outliers' function to handel outliers in the dataset]]**

### Dealing with Duplicate Data
Duplicate data can lead to misleading analysis. Pandas provides functions to detect and remove duplicate rows. Here’s how we can do it:


In [13]:
#Your code goes here

**[Question 4: Complete the 'handle_duplicates' function to handle outliers in the dataset]**

### Data Reshaping
Reshaping data is the process of transforming data from one format to another. In the context of data analysis and machine learning (ML), reshaping data often involves reorganizing it into a different structure that is better suited for analysis, visualization, or modeling. Reshaping can involve tasks such as pivoting, melting, stacking, unstacking, and more.

#### Wide to Long Format (Melting)
In this transformation, we convert a dataset from a wide format (many columns) to a long format (fewer columns) by melting or unpivoting it. This is useful when we have variables stored as columns and we want to gather them into a single column.

Melting data is useful for making it more suitable for analysis, especially when we want to compare or aggregate across different variables.

In [14]:
#Your code goes here

**Code Explanation**
-  The **pd.melt()** function is used to transform the *df* DataFrame from wide format to long format.
-  **id_vars=['Depth (m)']** keeps the ‘Depth (m)’ column as an identifier. This means that each temperature and salinity reading will be linked to its depth.

- **value_vars=['Temperature (°C)', 'Salinity (ppt)']** tells the function which columns to transform. In this case, we are melting the ‘Temperature (°C)’ and ‘Salinity (ppt)’ columns into a single column. This helps us look at temperature and salinity together.

- **var_name='Measurement'** sets the name for the new column that will show what type of measurement each value represents (either temperature or salinity).

- **value_name='Value'** sets the name for the new column that will show the actual measurement values for temperature and salinity.

#### Long to Wide Format (Pivoting)
This transformation involves converting a long-format dataset back into a wide format by pivoting or spreading the values.

Pivoting is useful when we want to reshape data to make it easier to visualize or perform calculations on.


In [15]:
#Your code goes here

**Code Explanation** 
-  The **df_long.pivot()** function is used to transform the df_long DataFrame from a long format to a wide format.
-  **index='Depth (m)'** specifies that the new DataFrame should use the 'Depth (m)' column as the row labels. This means each unique depth will become a row in the new DataFrame.
- **columns='Measurement'** tells the function to use the values in the 'Measurement' column (Temperature (°C) and Salinity (ppt)) as the column headers in the new DataFrame.
- **values='Value'** indicates that the actual data in the new DataFrame will come from the 'Value' column. So, it will fill the cells with the corresponding temperature and salinity values at each depth.

### Stacking and Unstacking
Stacking involves converting columns into rows, and unstacking is the reverse process. These operations can be useful for creating hierarchical indexes and dealing with multi-level data.

Stacking and unstacking can make data manipulation and analysis easier when dealing with multi-indexed data.

In [16]:
#Your code goes here

Stacking transforms a DataFrame from a wide format to a long format by moving a level of the column index to the row index. This operation is useful when you want to collapse hierarchical column indices into a simpler format.

Unstacking is the reverse of stacking. It transforms a DataFrame from a long format to a wide format by moving a level of the row index to the column index.


### Handling Inconsistent Data and Standardizing
Handling inconsistent data is a crucial step in data preprocessing to ensure the accuracy and reliability of our analysis or modeling. Inconsistent data refers to values that do not adhere to the expected format or constraints. This can include typos, varying representations, or unexpected values in categorical variables.

Suppose we have a dataset with a “Water_Type” column that contains variations of the categories “Freshwater”, “Salty”, and “Brackish”. To handle inconsistencies, we can standardize the values.

In [17]:
#Your code goes here

**Code Explanation**
- **df['Water_Type']:** This gets the 'Water_Type' column from the DataFrame so we can work with it.

- **.str.lower()**: This changes all the text in the 'Water_Type' column to lowercase. This way, we make sure everything is written the same way.

- **.str.strip()**: This removes any extra spaces at the start or end of the text in the 'Water_Type' column. This helps avoid mistakes caused by those spaces.

- **.replace({'freshwater': 'Freshwater', 'salty': 'Salty', 'brackish': 'Brackish'})**: This replaces the words in the 'Water_Type' column. It changes 'freshwater' to 'Freshwater', 'salty' to 'Salty', and 'brackish' to 'Brackish'. This makes sure we have a consistent way of writing these terms.

- **Updating the Column**: Finally, we save the cleaned-up values back into the 'Water_Type' column. This makes the data neat and ready for analysis.


Now, suppose we have a dataset with a “Water_Type” column that contains various names for types of water, including some inconsistent spellings and synonyms. We want to standardize these water type names.

In [18]:
#Your code goes here

**Code Explanation**
- **Mapping Definition:** A dictionary called water_type_mapping is created to map the inconsistent names to standardized names. For example, both 'freshwater' and 'FRESHWATER' will be converted to 'Freshwater'.

- **Applying the Mapping:** We convert the ‘Water_Type’ column to lowercase and use .map(water_type_mapping) to replace the values based on our defined mapping.

- **Final Output:** Finally, we print the updated DataFrame, which now contains consistent water type names.

 **[Question 5: Complete the 'standardize_data' function to standardizes the 'Species' column in the dataset]**