<a href="https://www.kaggle.com/code/faiqueali/plotting-visualization?scriptVersionId=142882595" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<p><center><h1>Plotting and Visualization</h1></center></p>
<p><center><h1></h1></center></p>

## Plotting Significance
1. <strong>Enhances Data Understanding:</strong> Visualization helps in grasping complex data patterns, trends, and relationships more effectively than raw numbers or tables.

2. <strong>Identifies Outliers:</strong> Visualizations make it easier to spot outliers or anomalies in the data, which are crucial for data quality assessment.

3. <strong>Supports Data Cleaning:</strong> Visualizations aid in identifying missing data, duplicates, and inconsistencies, assisting in the data cleaning process.

4. <strong>Enables Exploratory Data Analysis (EDA):</strong> Plots allow data scientists to perform EDA, uncover hidden insights, and formulate hypotheses for further investigation.

5. <strong>Facilitates Decision-Making:</strong> Data visualizations provide stakeholders with clear, concise, and actionable insights, aiding in informed decision-making.

6. <strong>Communicates Findings:</strong> Visualizations are an effective means to convey results and findings to both technical and non-technical audiences.

7. <strong>Improves Model Selection:</strong> Visualizations help in selecting appropriate machine learning models by assessing data distribution and model performance.

8. <strong>Monitors Trends Over Time:</strong> Time series visualizations track data trends, allowing for real-time monitoring and forecasting.

9. <strong>Supports Feature Engineering:</strong> Visualizations assist in identifying and engineering relevant features for machine learning models.

10. <strong>Enhances Storytelling:</strong> Visual narratives can be created using data visualizations to tell a compelling story about the data, making it more engaging and persuasive.

## Types of Plot
<p><i>For more plot types. Follow the links below</i></p>
<a href="https://matplotlib.org/stable/plot_types/index.html">Matplotlib documentation</a>

<a href="https://seaborn.pydata.org/examples/index.html">Seaborn documentation</a>

## Difference Between Matplotlib and Seaborn
**Matplotlib** and **Seaborn** are both Python libraries for **data visualization**. However, they have different strengths and weaknesses.

Matplotlib is a more **general-purpose library** that can be used to create a variety of charts and graphs. It is also more customizable, giving you more control over the appearance of your plots. However, Matplotlib can be more **difficult to learn** and use, especially for beginners.

Seaborn is a more **specialized library** that is designed for statistical data visualization. It provides a number of pre-built functions for creating common **statistical plots**, such as scatter plots, line plots, and box plots. Seaborn is also **easier to learn** and use than Matplotlib, making it a good choice for beginners.

Here are some additional things to consider when choosing between Matplotlib and Seaborn:

* **Your level of experience**: If you are a beginner, then Seaborn is a good choice because it is easier to learn and use.
* **The type of data you need to visualize**: If you need to visualize statistical data, then Seaborn is a good choice because it provides a number of pre-built functions for creating common statistical plots.
* **The level of customization you need**: If you need a lot of control over the appearance of your plots, then Matplotlib is a good choice.

## 1. Line Plot

Line plots are used to visualize and analyze data that involves **continuous** or **sequential** variables, such as time series data or data with a natural order, to show trends, patterns, or changes over time or a progression of values. They are particularly useful for displaying data points connected by lines to **illustrate the relationship or progression between them**.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# Sample data
x = [1, 2, 3, 4, 5]  # X-axis values
y = [2, 4, 6, 8, 10]  # Y-axis values

# Create a line plot
plt.plot(x, y)

# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')

# Display the plot
plt.show()

## 2. Scatter Plot

Scatter plots are used to visualize and analyze the relationships or correlations between **two continuous variables**, making them ideal for **exploring patterns**, **trends**, **clusters**, or **outliers** in data. They are particularly useful for **identifying associations**, such as positive or negative correlations, and assessing the **dispersion** or **clustering** of data points.

In [None]:
# Generate data for x and y coordinates
x = np.arange(80)
y = np.arange(80) + 6 * np.random.randn(80) 

# Create a scatter plot
plt.scatter(x, y, s=30, alpha=0.5, color='blue')

# Add labels and a title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')

# Display the plot
plt.show()

## 3. Bar Chart

Bar charts are used to **represent** and **compare** discrete categories or groups of data, making them suitable for visualizing categorical data and showing the relative **sizes**, **frequencies**, or **counts** of different categories. They are particularly effective for displaying data that is not continuous and for making comparisons between distinct categories or groups.

In [None]:
# Sample data
categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [25, 40, 30, 50]

# Create a bar chart
plt.bar(categories, values, color='skyblue')

# Add labels and a title
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')

# Display the plot
plt.show()

## 4. Histogram Plot

Histograms are used to visualize the **distribution of continuous or numerical data**, allowing us to understand the **frequency and pattern of values** within specific intervals (bins). They are particularly useful for **exploring the shape**, **central tendency**, and **spread of data**, as well as identifying potential outliers or modes in the distribution.

**Exploring the Shape:** This refers to understanding the overall pattern or form of a data distribution, such as whether it is symmetric, skewed, bimodal (having two peaks), or uniform.

**Central Tendency:** It involves examining where the data tends to cluster or concentrate, typically described by measures like the mean (average), median (middle value), or mode (most frequent value).

**Spread of Data**: This relates to how data points are dispersed or spread out across the range, which can be assessed using measures like standard deviation, range, or interquartile range. It helps in understanding the variability or consistency of the data.Exploring the Shape: This refers to understanding the overall pattern or form of a data distribution, such as whether it is symmetric, skewed, bimodal (having two peaks), or uniform.

In [None]:
# Generate some random data for the histogram
data = np.random.randn(1000)  # Replace this with your own data

# Create a histogram
plt.hist(data, bins=20, color='blue', alpha=0.6)

# Add labels and a title
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Plot')

# Display the plot
plt.show()

The **bins** attribute in **plt.hist()** specifies the number of **intervals** into which the data will be divided for creating a histogram.

Here's how the bins attribute works and why it's important:

**Number of Bins:** When you set the bins attribute, you are essentially dividing the range of your data into discrete intervals or bins. Each bin represents a specific range of values.

**Data Aggregation:** The histogram counts how many data points fall into each bin. A larger number of bins will result in smaller, more detailed intervals, whereas a smaller number of bins will result in larger, broader intervals.

**Visualization:** The choice of the number of bins can significantly affect the appearance of the histogram. Too few bins may oversimplify the data distribution, while too many bins may make the plot noisy and harder to interpret.

**Balancing Act:** Selecting the right number of bins is often a balance between capturing the underlying data distribution and making the plot visually informative. There are various rules of thumb and mathematical methods for selecting an optimal number of bins.

In [None]:
# Generate some random data for the histogram
data = np.random.randn(1000)

# Create a histogram with different numbers of bins
plt.figure(figsize=(12, 4))

plt.subplot(131)
plt.hist(data, bins=10, color='blue', alpha=0.7)
plt.title('10 Bins')

plt.subplot(132)
plt.hist(data, bins=20, color='green', alpha=0.7)
plt.title('20 Bins')

plt.subplot(133)
plt.hist(data, bins=50, color='red', alpha=0.7)
plt.title('50 Bins')

plt.tight_layout()
plt.show()


In above example, we create three histograms with different numbers of bins (10, 20, and 50) for the same dataset. As you can see, the choice of the number of bins affects the granularity of the histogram and how it represents the data distribution.

### Difference between Bar chart and Histogram

1. **Data Type**:
   - **Bar Chart**: Bar charts represent discrete or categorical data where each bar typically corresponds to a distinct category or group.
   - **Histogram**: Histograms visualize continuous data and display the frequency distribution of values within specified bins or intervals.

2. **X-Axis**:
   - **Bar Chart**: The X-axis in a bar chart represents discrete categories or labels.
   - **Histogram**: The X-axis in a histogram represents the range of values divided into bins or intervals.

3. **Use Cases**:
   - **Bar Chart**: Used for comparing categories or groups, displaying non-continuous data, and showing relative sizes or counts among distinct items.
   - **Histogram**: Utilized for visualizing the distribution of continuous data, highlighting patterns, skewness, and central tendencies.

4. **Spacing**:
   - **Bar Chart**: Typically, there is space between the bars to emphasize the distinction between categories.
   - **Histogram**: Bars in a histogram are contiguous, representing continuous ranges of values.

5. **Data Transformation**:
   - **Bar Chart**: The data for each bar is usually not transformed; it's directly represented as distinct values.
   - **Histogram**: Data is transformed into frequency counts within bins, emphasizing the distribution of values rather than individual data points.

## 5. Pie Chart

Pie charts are used to **represent and compare parts of a whole**, making them suitable for visualizing data where individual categories or components contribute to a **total** or **percentage** composition, such as market share, budget allocation, or demographic distribution. They are effective for displaying data with a limited number of categories or segments.

In [None]:
# Sample data
labels = ['Category A', 'Category B', 'Category C', 'Category D']
sizes = [15, 30, 45, 10]  # Percentages (summing up to 100%)

# Create a pie chart
#autopct parameter adds percentage labels
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['skyblue', 'lightgreen', 'lightcoral', 'lightsalmon']) 

# Add a title
plt.title('Pie Chart')

# Display the plot
plt.show()

## 6. Box Plot

Box plots are used to visualize and compare the **distribution**, **central tendency**, and **spread** of numerical data, making them suitable for identifying outliers, assessing the skewness of data, and comparing the characteristics of different datasets or categories. They are particularly **effective** when analyzing data **with multiple groups** or variables to understand their statistical properties.

In [None]:
# Sample data for two datasets
data1 = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
data2 = [5, 15, 25, 30, 35, 40, 45, 50, 60, 70]

# Create a figure and axis
fig, ax = plt.subplots()

# Create box plots for the two datasets
box_plot_data = [data1, data2]
box_plot_labels = ['Dataset 1', 'Dataset 2']

# Customize box plots
boxprops = dict(linewidth=2, color='darkblue')
medianprops = dict(linewidth=2, color='red')
whiskerprops = dict(linewidth=2, color='green')
capprops = dict(linewidth=2, color='purple')

# Create the box plots
bplot = ax.boxplot(box_plot_data,
                   vert=True,
                   patch_artist=True,
                   labels=box_plot_labels,
                   boxprops=boxprops,
                   medianprops=medianprops,
                   whiskerprops=whiskerprops,
                   capprops=capprops)

# Add labels and a title
ax.set_xlabel('Datasets')
ax.set_ylabel('Values')
ax.set_title('Box Plot Example')

# Customize the colors of the boxes
colors = ['lightblue', 'lightgreen']
for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color)

# Add a legend
plt.legend([bplot["boxes"][0], bplot["boxes"][1]], ['Dataset 1', 'Dataset 2'])

# Show the plot
plt.grid(True)
plt.show()

In the provided Python code for creating box plots, **boxprops**, **medianprops**, **whiskerprops**, and **capprops** are dictionaries used to specify properties for different parts of the box plot:

**boxprops**: This dictionary defines properties for the box of the box plot. The properties specified in boxprops are applied to the rectangular box that represents the interquartile range (IQR) of the data. In the code example, linewidth=2 and color='darkblue' are set for boxprops, which means that the box lines will have a width of 2 and be colored dark blue.

**medianprops**: This dictionary defines properties for the median line of the box plot. The median is the line inside the box that represents the median value of the data. In the code, linewidth=2 and color='red' are set for medianprops, indicating that the median line will have a width of 2 and be colored red.

**whiskerprops**: This dictionary defines properties for the whiskers of the box plot. The whiskers are the lines that extend from the box to the data points that are not considered outliers. In the code, linewidth=2 and color='green' are set for whiskerprops, which means that the whisker lines will have a width of 2 and be colored green.

**capprops**: This dictionary defines properties for the caps at the end of the whiskers. The caps are small horizontal lines at the end of the whiskers. In the code, linewidth=2 and color='purple' are set for capprops, indicating that the caps will have a width of 2 and be colored purple.

*By customizing these properties, you can control the appearance of different parts of the box plot to make it look the way you want. You can adjust the line widths, colors, and other visual aspects of the box plot to suit your preferences and the requirements of your data visualization.*

# Pair plot

A pair plot is a graphical method for exploring the **relationship between two or more variables**. It is a useful tool for **EDA** (Exploratory Data Analysis) and can help us to identify **patterns**, **trends**, and **outliers** in our data.

* To identify the **relationship** between two variables, such as the relationship between height and weight.
* To identify **clusters** of data, such as groups of customers with similar spending habits.
* To identify **outliers**, such as data points that are significantly different from the rest of the data.

In [None]:
# Load the iris dataset
iris = sns.load_dataset("iris")

# Create a pair plot of the iris dataset
sns.pairplot(iris)

# Show the plot
plt.show()

## Heat Map

A heat map is a graphical representation of data where values are portray by color. The variation in color may be by intensity. Heat maps can be used to visualize a variety of data, including:

* **Numerical data**: Heat maps can be used to visualize the distribution of numerical data. For example, a heat map could be used to show the distribution of income in a population.
* **Categorical data**: Heat maps can also be used to visualize categorical data. For example, a heat map could be used to show the distribution of political party affiliation in a population.
* **Geographic data**: Heat maps can also be used to visualize geographic data. For example, a heat map could be used to show the population density of a country.

*In a heat map, the color of each cell represents the value of the data at that location. The darker the color, the higher the value. The lighter the color, the lower the value.*


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Generate random data
data = np.random.rand(10, 10)

# Create a heatmap
sns.heatmap(data, cmap="coolwarm", annot=True)

# Customize the plot
plt.title("Heatmap of Random Data")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()