# Visualizing Missing Data and Detecting Outliers 
## Introduction
**Introduction**
This Jupyter Notebook aims to visualize missing data in the crime dataset for the City of Los Angeles, starting from 2020. We would also attempt to visualize outlisers in the data set and eliminate these outliers. The dataset contains comprehensive records of crime incidents, sourced from original crime reports, some of which were originally typed on paper, leading to potential inaccuracies. Additionally, certain location fields may contain missing data denoted as (0°, 0°). For privacy reasons, address fields are limited to the nearest hundred block. While the data is generally reliable, any questions or concerns can be addressed through comments.

**Dataset Description**
The dataset consists of various columns representing different attributes of crime incidents in Los Angeles. We will read the CSV file and perform an analysis to visualize the presence of missing data using Plotly.

**About the Dataset**:
1. Dataset Name: Crime_Data_from_2020_to_Present - https://www.kaggle.com/datasets/venkatsairo4899/los-angeles-crime-data-2020-2023.
2. Source: Original crime reports
3. Time Period: Starting from 2020

# Step 1 
**Importing Libraries and Reading the Data**

We begin by importing the necessary Python libraries that will be used throughout the script. Each library serves a specific purpose in data handling, visualization, and outlier detection.

1. NumPy (imported as np): NumPy is a powerful library for numerical computations in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays efficiently.

2. Pandas (imported as pd): Pandas is a widely-used library for data manipulation and analysis in Python. It introduces the DataFrame and Series data structures, which are powerful tools to work with structured data, including CSV files, spreadsheets, and databases.

3. Plotly.graph_objects (imported as go): Plotly is a versatile library for creating interactive and visually appealing data visualizations. We specifically import the graph_objects module, which allows us to create various types of plots and graphs, such as heatmaps and boxplots.

4. Plotly.express (imported as px): Plotly Express is a high-level interface for creating a variety of interactive visualizations quickly and easily. It simplifies the process of creating complex plots by providing a concise syntax and intuitive API.

5. Scikit-learn IsolationForest: Scikit-learn (sklearn) is a popular machine learning library in Python. The IsolationForest class is an implementation of the Isolation Forest algorithm, which is used for outlier detection. It isolates instances in a dataset by constructing decision trees and identifying anomalies as instances that require fewer splits to be isolated.

6. Scikit-learn train_test_split: This function from scikit-learn is used for splitting data into training and testing sets. It is commonly used in machine learning workflows to divide the data into two subsets: one for training the model and the other for evaluating its performance.

7. Pyculiarity detect_ts: Pyculiarity is a library for anomaly detection in time series data. The detect_ts function from pyculiarity provides a method for detecting anomalies in time series using the Twitter Anomaly Detection Algorithm (AD) and Seasonal Hybrid ESD (Extreme Studentized Deviate) Test.

8. heapq nlargest: The heapq module is used for heap queue algorithm implementation in Python. The nlargest function from heapq returns the n largest elements from a given iterable.

9. Statsmodels.tsa.seasonal seasonal_decompose: Statsmodels is a library for time series analysis in Python. The seasonal_decompose function from statsmodels.tsa.seasonal is used for seasonal decomposition of time series data to analyze its underlying components, including trend, seasonality, and residuals.

By importing these libraries at the beginning of the script, we ensure that we have access to the necessary tools and functionalities for handling data, creating visualizations, and performing outlier detection using the Isolation Forest algorithm. This sets the foundation for the subsequent steps in the analysis of the crime dataset for the City of Los Angeles.


In [None]:
# Step 1: Importing necessary libraries
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from pyculiarity import detect_ts
from heapq import nlargest
from statsmodels.tsa.seasonal import seasonal_decompose

# Step 2
**Exploring the Data**

Before visualizing missing data and detecting outliers, let's explore the dataset to understand its structure and contents.

To kick off we begin by loading the crime dataset from a CSV file into a pandas DataFrame for further analysis. Before running the code, make sure to replace the file_path variable with the actual file path of your CSV file containing the crime data.

- First, we define the file_path variable and assign it the path to the CSV file. The ``r`` before the string indicates that it is a raw string, which is used to avoid issues with backslashes in file paths on some operating systems.

- Next, we use the pandas library to read the CSV file using the ``pd.read_csv()`` function. The function takes the file path as an argument and loads the data from the CSV file into a DataFrame named df. The DataFrame is a two-dimensional tabular data structure that organizes the crime data in rows and columns.

- After loading the data, we display the first few rows of the DataFrame using the ```df.head()``` method. This allows us to inspect the structure of the data and verify that it was loaded correctly. The head() method by default shows the first 5 rows, giving us a glimpse of the data and its columns.

By executing this code, you will have successfully imported the crime dataset into a pandas DataFrame, and you can proceed with data analysis, visualization, and other tasks to gain insights and explore the patterns in the crime data from 2020 to the present in Los Angeles.

In [None]:
# Replace the file path with the actual path to your CSV file
file_path = r"C:\Users\NDU-PC\Desktop\archive\Crime_Data_from_2020_to_Present.csv"

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)
df.head()


# Step 3

**Visualizing Missing Data**

Now, we will identify missing values in the DataFrame and create a heatmap using Plotly to visualize the missing cells.

- At this point, are creating a heatmap to visualize the presence of missing data in the crime dataset using Plotly. The goal is to create a visual representation where each cell with data is represented by a black color, and each cell without data (missing data) is represented by a white color.

- First, we create a binary matrix (binary_data) from the original crime dataset (df) using the ``notnull()`` method. The binary matrix is constructed in such a way that it assigns a value of 1 to cells containing data and a value of 0 to cells with missing data (null values).

- Next, we use the Plotly library to create the heatmap. We initialize a new Figure object (fig) and create a Heatmap trace with the binary_data matrix. The z parameter is set to binary_data.values, which contains the binary values representing the presence of data (1) and missing data (0) in the dataset. The x and y parameters are set to df.columns and df.index, respectively, to label the heatmap's x and y axes with the column and row names from the crime dataset.

- To achieve the black-and-white representation, we customize the colorscale of the heatmap using the colorscale parameter. The colorscale is defined as ``[[0, 'white'], [1, 'black']]``, where 0 represents white and 1 represents black. As a result, cells with missing data (0 in binary_data) will be displayed as white, and cells with data (1 in binary_data) will be displayed as black.

- The showscale parameter is set to False to hide the color scale bar, as it is not needed in this visualization.

- We further customize the layout of the heatmap by setting the title of the figure to 'Missing Data Visualization' and labeling the x and y axes as 'Columns' and 'Rows', respectively.

- Finally, the Plotly figure is displayed using the ``fig.show() function``, which visualizes the presence of missing data in the crime dataset as a black-and-white heatmap. This visualization allows us to quickly identify the patterns of missing data and assess the completeness of the dataset in different columns and rows.

In [None]:
# Create a binary matrix where 1 represents data and 0 represents missing data
binary_data = df.notnull().astype(int)

# Create a heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
    z=binary_data.values,
    x=df.columns,
    y=df.index,
    colorscale=[[0, 'white'], [1, 'black']],
    showscale=False,
))

# Customize the layout of the heatmap
fig.update_layout(
    title='Missing Data Visualization',
    xaxis=dict(title='Columns'),
    yaxis=dict(title='Rows'),
)

# Show the Plotly figure
fig.show()

## Methods for Detecting Outliers: An Overview

Outliers, also known as anomalies or extreme values, are data points that significantly deviate from the rest of the data in a dataset. Detecting outliers is a crucial step in data analysis and modeling, as these unusual data points can have a significant impact on the results, leading to biased conclusions or erroneous predictions. Proper outlier detection helps improve the accuracy and reliability of data analysis, anomaly detection, and machine learning models. In this write-up, we will explore various methods for detecting outliers and discuss their strengths and appropriate use cases.

1. Z-Score Method: The Z-score method is a statistical technique for detecting outliers based on standard deviations from the mean. It calculates the Z-score of each data point, representing how many standard deviations the data point is away from the mean. Data points with Z-scores exceeding a specified threshold (commonly set at 2 or 3 standard deviations) are considered outliers.

**Strengths**:
- Simple and easy to implement.
- Suitable for data with a normal distribution.

**Use Cases**:
- Data with approximately normal distribution.
- When the mean and standard deviation accurately represent the data.

2. Modified Z-Score Method: The Modified Z-score method is a variation of the Z-score method that is more robust to outliers. It uses the Median Absolute Deviation (MAD) as a measure of dispersion instead of the standard deviation. Data points with modified Z-scores exceeding a threshold (typically 2.5 or 3) are flagged as outliers.

**Strengths**:
- Robust to outliers and resistant to extreme values.
- Suitable for data with non-normal distributions.

**Use Cases**:
- Data with skewed or heavy-tailed distributions.
- When outliers might have a significant impact on the mean and standard deviation.

3. Isolation Forest: The Isolation Forest algorithm is a machine learning-based method for outlier detection. It isolates outliers by randomly partitioning the data into trees and identifying instances that require fewer splits to be isolated. Outliers are isolated more quickly, leading to shorter paths in the trees.

**Strengths**:
- Efficient for high-dimensional data.
- Effective for datasets with a small proportion of outliers.

**Use Cases**:
- High-dimensional data, such as sensor readings or numerical data.
- Datasets with a small proportion of outliers compared to inliers.
- Numerical Data Analysis 

4. Local Outlier Factor (LOF): LOF is a density-based method for detecting outliers by measuring the local deviation of a data point from its neighbors. It compares the density of a data point with that of its neighbors, and data points with low density relative to their neighbors are considered outliers.

**Strengths**:
- Able to handle irregularly shaped clusters and varying densities.
- Suitable for datasets with complex structures.

**Use Cases**:
- Clustering and anomaly detection in spatial data.
- Data with varying densities or non-uniform clusters.
- Numerical data analysis.

5. Seasonal Hybrid ESD Test: The Seasonal Hybrid ESD Test is specifically designed for detecting outliers in time series data with seasonal patterns. It combines statistical decomposition and hypothesis testing to identify anomalies in seasonal time series.

**Strengths**:
- Effective for time series data with seasonal fluctuations.
- Can capture anomalies that deviate from expected seasonal behavior.

**Use Cases**:
- Time series data with clear seasonal patterns, such as weather data or financial time series.
- Anomaly detection in periodic data.

6. Outlier Detection in Categorical Data based on Frequency Distributions: This method detects outliers in categorical data by analyzing the frequency distribution of categories. It identifies categories that deviate significantly from the majority and might require further investigation or handling.

**Strengths**:
- Suitable for categorical variables and frequency analysis.
- Can identify rare or unusual categories.

**Use Cases**:
- Analyzing customer segments or product categories with low occurrence.
- Identifying infrequent events or occurrences in categorical data.
- Preliminary analysis to gain insights into the distribution of categories.

**Conclusion**:
Selecting the appropriate method for detecting outliers depends on the nature of the data, the distribution of the variables, and the specific problem at hand. It is essential to consider the strengths and limitations of each technique and tailor the outlier detection approach to the characteristics of the dataset. A combination of multiple methods may be employed for a comprehensive outlier analysis. Proper outlier detection contributes to more accurate data analysis, reliable modeling, and better decision-making, ultimately leading to more robust and meaningful insights from the data.

# Step 4

**Detecting Outliers using Isolation Forest**

Next, we will apply the Isolation Forest algorithm to detect outliers in the dataset.

Here are preparing the dataset for outlier detection using the Isolation Forest algorithm. We start by selecting the "AREA" and "Vict Age" columns from the original crime dataset. These columns are chosen because we want to focus on detecting potential outliers in the crime incidents' geographic locations (AREA) and the ages of the victims (Vict Age) and because Isolation Forest Algorithm works best with numerical datasets. 

After selecting the columns of interest, we set parameters for the Isolation Forest algorithm. The ``CONTAMINATION`` parameter specifies the expected proportion of outliers in the data, and in this case, it is set to 0.2, which means we expect around 20% of the data to be outliers. The ``BOOTSTRAP`` parameter is set to False, indicating that we will not use the bootstrap method during the isolation forest algorithm execution.

Next, we extract the data from the selected DataFrame ``(selected_df)`` into a numpy array (data). The array X contains all columns except the last column, which is the target variable (y) representing the "Vict Age" column. The X array will be used for outlier detection, and the y array will be used to assess the algorithm's performance.

We then create an instance of the Isolation Forest classifier ``(i_forest)`` with the specified contamination and bootstrap parameters. The Isolation Forest algorithm is a popular unsupervised anomaly detection technique that isolates outliers by randomly partitioning the data and identifying instances that are easier to isolate.

The fit_predict method is used to perform outlier detection on the X array. It returns a binary mask ``(is_inlier)`` where 1 represents inliers (non-outliers), and -1 represents outliers.

Next, we create a mask (mask) that filters out the outliers by selecting only the rows in the X array that are classified as inliers (1). This process effectively removes the detected outliers from the data, leaving us with a subset of data that contains only the non-outlier instances.

The resulting X_train array contains the data without outliers and is ready for further analysis, visualization, or modeling tasks. By removing outliers from the selected columns, we can gain a better understanding of the central tendency and patterns in the geographic locations and ages of the victims in the crime dataset.

In [None]:
# Select only the "AREA" and "Vict Age" columns for outlier detection
selected_columns = ["AREA", "Vict Age"]
selected_df = df[selected_columns]

# Set parameters for Isolation Forest
CONTAMINATION = 0.2
BOOTSTRAP = False

data = selected_df.values
X, y = data[:, :-1], data[:, -1]

i_forest = IsolationForest(contamination=CONTAMINATION, bootstrap=BOOTSTRAP)
is_inlier = i_forest.fit_predict(X)

mask = is_inlier != -1
X_train = X[mask, :]

# Step 5

**Visualizing the Outlier Detection Results (Boxplots)** - *Isolation Forest algorithm* 

We will compare selected columns with and without outliers using boxplots.

In this portion of the code, we are performing outlier detection using the Isolation Forest algorithm on the selected columns "AREA" and "Vict Age" from the crime dataset. The Isolation Forest algorithm identifies potential outliers in these columns and creates a binary mask (mask) where 1 represents inliers (non-outliers) and -1 represents outliers.

Using the binary mask, we create a new DataFrame (df_wo_outliers) containing only the data points that are considered inliers (without outliers) for the selected columns. This DataFrame will be used to compare the distribution of the selected columns with and without outliers.

Next, we create boxplots to visualize the distribution of the "AREA" and "Vict Age" columns both with and without outliers. The boxplot labeled "With outliers" represents the original distribution of the selected columns from the dataset, including the outliers. The boxplot labeled "Without outliers" represents the distribution of the selected columns after removing the detected outliers.

The boxplots provide a visual comparison of the central tendency, spread, and presence of outliers for each column before and after outlier removal. Additionally, we set titles for each boxplot to indicate that they display the "Outlier Detection Results" for the "AREA" and "Vict Age" columns, respectively.

This analysis helps us understand the impact of outliers on the distribution of data in these columns and allows us to make informed decisions on how to handle outliers during further data analysis and modeling tasks.

In [None]:
# Create a DataFrame without outliers
df_wo_outliers = pd.DataFrame(data=selected_df[mask], columns=selected_df.columns)

# Compare 'AREA' column with and without outliers using boxplots
fig_area_boxplot = go.Figure()
fig_area_boxplot.add_trace(go.Box(y=df[selected_columns[0]], name='With outliers'))
fig_area_boxplot.add_trace(go.Box(y=df_wo_outliers[selected_columns[0]], name='Without outliers'))
fig_area_boxplot.update_layout(title='Outlier Detection Results for "AREA" Column')
fig_area_boxplot.show()

# Compare 'Vict Age' column with and without outliers using boxplots
fig_vict_age_boxplot = go.Figure()
fig_vict_age_boxplot.add_trace(go.Box(y=df[selected_columns[1]], name='With outliers'))
fig_vict_age_boxplot.add_trace(go.Box(y=df_wo_outliers[selected_columns[1]], name='Without outliers'))
fig_vict_age_boxplot.update_layout(title='Outlier Detection Results for "Vict Age" Column')
fig_vict_age_boxplot.show()

# Step 6 

**Detecting Outliers using Seasonal Hybrid ESD Test (Extreme Studentized Deviate)**

The type of outlier detection used below is "Seasonal Hybrid ESD (Extreme Studentized Deviate) Test." This method combines elements of traditional statistical methods and machine learning techniques to detect anomalies in time series data.

Here is a breakdown of the outlier detection process using Seasonal Hybrid ESD Test in the code:

- Seasonal Decomposition: The code first performs seasonal decomposition on the time series data using the seasonal_decompose function from the `statsmodels.tsa.seasonal module`. Seasonal decomposition breaks down the time series into its underlying components: trend, seasonality, and residuals (remainder).

- Seasonally Adjusted Values: After decomposing the time series, the code obtains the seasonally adjusted values (residuals or remainder component) by subtracting the seasonality component from the original data. This leaves only the irregular fluctuations in the data, which are expected to be close to zero for a well-behaved time series.

- Mean and Standard Deviation: The code calculates the mean and standard deviation of the seasonally adjusted values. These statistics are used to determine the threshold for identifying outliers.

 - Threshold for Outliers: A threshold is set based on the standard deviation. In this case, the threshold is determined to be two times the standard deviation. Data points that deviate from the mean by more than this threshold are considered potential outliers.

- Identify Outliers: The code identifies the potential outliers by comparing the seasonally adjusted values to the mean and threshold. Any data points with values that significantly deviate from the mean are considered outliers.

- Create DataFrame Without Outliers: A new DataFrame `df_wo_outliers` is created as a copy of the original data. I did this to bypass some date parsing errors I had encountered. Then identified outlier values are replaced with `NaN`. The `NaN` values are then filled using `forward fill (ffill)` to ensure a continuous time series without gaps.

Overall, the Seasonal Hybrid ESD Test is a robust method for outlier detection in time series data, especially when the data exhibits seasonal patterns or periodic fluctuations. It can effectively identify anomalous data points that deviate from the expected seasonal behavior, making it useful for various applications such as anomaly detection in sensor data, financial data, and environmental data.

In [None]:
# Assuming you have already converted "Date Rptd" column to datetime format
df["Date Rptd"] = pd.to_datetime(df["Date Rptd"])

# Convert all dates to DD-MM-YYYY format
df["Date Rptd"] = df["Date Rptd"].dt.strftime("%d-%m-%Y")

# Check for and remove duplicate values in "Date Rptd" column
df = df.drop_duplicates(subset=["Date Rptd"], keep="first")

# Create a new DataFrame with the modified "Date Rptd" column
new_df = df.copy()

# Sort the DataFrame by the "Date Rptd" column in ascending order
new_df.sort_values(by="Date Rptd", inplace=True)

# Create a new DataFrame with the required format (timestamps in the first column and numeric values as the second column)
data_for_detection = pd.DataFrame(data={"timestamp": pd.to_datetime(new_df["Date Rptd"]), "value": new_df.index})

# Perform seasonal decomposition using statsmodels' seasonal_decompose function
decomposition = seasonal_decompose(data_for_detection["value"], period=7)  # Assuming weekly seasonality (period=7)

# Get the seasonally adjusted values (remainder component)
data_for_detection["seasonally_adjusted"] = decomposition.resid

# Calculate the mean and standard deviation of the seasonally adjusted values
mean_seasonally_adjusted = data_for_detection["seasonally_adjusted"].mean()
std_seasonally_adjusted = data_for_detection["seasonally_adjusted"].std()

# Set the threshold for outliers (e.g., 2 standard deviations from the mean)
threshold = 2 * std_seasonally_adjusted

# Identify the outliers
outliers = data_for_detection[np.abs(data_for_detection["seasonally_adjusted"] - mean_seasonally_adjusted) > threshold]

# Get the outlier timestamps and values
outlier_timestamps = outliers["timestamp"]
outlier_values = outliers["seasonally_adjusted"]

# Create a DataFrame without outliers
df_wo_outliers = data_for_detection.copy()
df_wo_outliers["seasonally_adjusted"] = np.where(df_wo_outliers["timestamp"].isin(outlier_timestamps), np.nan, df_wo_outliers["seasonally_adjusted"])
df_wo_outliers.ffill(inplace=True)

# Compare 'Date Rptd' column with and without outliers using boxplots
fig_date_rptd_boxplot = go.Figure()
fig_date_rptd_boxplot.add_trace(go.Box(y=new_df["Date Rptd"], name='With outliers'))
fig_date_rptd_boxplot.add_trace(go.Box(y=df_wo_outliers["timestamp"].dt.strftime("%d-%m-%Y"), name='Without outliers'))
fig_date_rptd_boxplot.update_layout(title='Outlier Detection Results for "Date Rptd" Column')
fig_date_rptd_boxplot.show()


# Step 7 

**Outlier Detection in Categorical Data based on Frequency Distribution**

The type of outlier detection used in this code is "Outlier Detection in Categorical Data based on Frequency Distribution." It identifies outlier categories by considering the frequency of each category and comparing it to a threshold. Categories with frequencies significantly higher than the average frequency plus the threshold are considered outliers. The goal is to find and analyze the categories that deviate substantially from the majority and might require further investigation or handling in the data analysis process.

To do this, I use the following steps to perform outlier detection based on the frequency distribution of categorical data in the "AREA NAME" column. 

- Calculate the frequency of each category: The code calculates the frequency of each unique category in the `"AREA NAME"` column using the `value_counts()` function and stores the result in the variable area_counts.

- Determine the overall frequency distribution (optional, for visualization): The code creates a bar plot using Plotly Express to visualize the frequency distribution of each area. The x-axis represents the area names, and the y-axis represents the frequency of occurrences. This step is optional and helps understand the distribution of data before performing outlier detection.

- Set a threshold: The code sets a threshold for outlier detection. The threshold is calculated as three times the standard deviation of the frequency counts in the area_counts variable. You can adjust this threshold based on your data and domain knowledge.

- Identify outlier categories: The code identifies outlier categories by comparing the frequency counts with the mean plus the threshold value. Any category with a frequency count greater than the mean plus the threshold is considered an outlier. The outlier categories are stored in the `outlier_areas variable`.

- Flag or remove outlier categories from the DataFrame: The code creates a copy of the original DataFrame `df` and adds a new column called `"Is_Outlier."` It marks each row with True if the corresponding `"AREA NAME"` is an outlier and False otherwise.

- A new DataFrame `df_without_outliers` is created by filtering out the rows with outlier categories from the original DataFrame `df_with_outliers`. This DataFrame contains only the non-outlier rows.

Finally, the code uses Plotly Express to create two box plots side by side: one representing the distribution of the "AREA NAME" column with all areas (including outliers), and the other representing the distribution of the "AREA NAME" column with outlier areas removed.

In [None]:
# Step 1: Calculate the frequency of each category
area_counts = df["AREA NAME"].value_counts()

# Step 2: Determine the overall frequency distribution (optional, for visualization)
fig_area_counts = px.bar(area_counts, x=area_counts.index, y=area_counts.values, labels={"x": "Area Name", "y": "Frequency"})
fig_area_counts.update_layout(title="Frequency Distribution of Area Names")
fig_area_counts.show()

# Step 3: Set a threshold (you can adjust this based on your data)
threshold = 3 * area_counts.std()

# Step 4: Identify outlier categories
outlier_areas = area_counts[area_counts > (area_counts.mean() + threshold)].index

# Print or do further analysis on outlier areas
print("Outlier Area Names:")
print(outlier_areas)

# Create a copy of the DataFrame with outliers
df_with_outliers = df.copy()

# Step 5: Flag or remove outlier categories from the DataFrame
df_with_outliers["Is_Outlier"] = df_with_outliers["AREA NAME"].isin(outlier_areas)

# Create a DataFrame without outliers
df_without_outliers = df_with_outliers.loc[~df_with_outliers["Is_Outlier"]]

# Create boxplots to compare 'AREA NAME' column with and without outliers
fig_boxplot = px.box(df, x="AREA NAME", title="Box Plot - All Areas")
fig_boxplot_wo_outliers = px.box(df_without_outliers, x="AREA NAME", title="Box Plot - Areas Without Outliers")

# Show both boxplots side by side
fig_boxplot.show()
fig_boxplot_wo_outliers.show()


# Conclusion

This Jupyter Notebook focused on visualizing missing data and detecting outliers in the crime dataset for the City of Los Angeles from 2020 onwards.

We used various Python libraries, including Pandas, Plotly, scikit-learn, and statsmodels, to handle data, create visualizations, and perform outlier detection.

Here's a summary of the key steps and findings:

- Visualized missing data using a heatmap to identify data gaps in different columns and rows.

- Detected outliers in numerical columns "AREA" and "Vict Age" using the Isolation Forest algorithm.

- Detected outliers in the time series data for the "Date Rptd" column using the Seasonal Hybrid ESD Test.

- Detected outlier categories based on the frequency distribution of the categorical data in the "AREA NAME" column.

The analysis provides valuable insights into data quality and potential anomalies in the crime dataset. This information can be used to guide data preprocessing, cleansing, and modeling processes in crime data analysis and forecasting. The methods demonstrated here can be applied to similar datasets for visualizing missing data and detecting outliers in various applications.

**N.B**

*Please note, It is important to be careful when detecting and eliminating outliers. For example, given that this is a crime data set eliminating outliers can result in the loss of valuable insights into crimes commited in the region. This notebook only serves for illustrative purposes and you should employ some intuition when using these techniques in real life.* 
