### 1/What are common preprocessing steps? Explain for each step why and when you should execute this step and when not.

Sure! Here's a brief explanation of each preprocessing step and when it should be executed:

1. **Data Cleaning**: Handle missing data, outliers, and inconsistent values. Execute this step to ensure data accuracy and reliability. Only perform data cleaning if the dataset has missing values, outliers, or inconsistencies that can impact analysis results.

2. **Data Transformation**: Normalize, scale, or encode categorical variables. Execute this step to ensure variables are on a similar scale or categorical variables are represented numerically. Perform data transformation when variables have different units or ranges, or when algorithms require numeric input.

3. **Handling Imbalanced Data**: Address class imbalance in datasets. Execute this step if the distribution of classes or categories is uneven and may introduce bias. Skip this step if the dataset is balanced or the class imbalance is not relevant to the analysis.

4. **Feature Selection/Extraction**: Identify relevant features or create new ones. Execute these steps to improve model performance, reduce complexity, and focus on informative features. Perform feature selection/extraction when dealing with high-dimensional data or when certain features are irrelevant or redundant.

5. **Data Formatting**: Ensure consistent formatting across the dataset. Execute this step to ensure uniformity and compatibility. Perform data formatting when inconsistencies in data types, date formats, or units of measurement may cause errors or misinterpretation.

6. **Dimensionality Reduction**: Reduce the number of features while retaining the essential structure. Execute this step to aid visualization, noise reduction, and highlight patterns. Perform dimensionality reduction when dealing with high-dimensional data for enhanced analysis and visualization.

The decision to execute each preprocessing step depends on the specific characteristics of the dataset and the analysis goals. Consider the quality, cleanliness, and suitability of the data, and skip a step if it's not relevant or necessary for the analysis at hand.

### 2/What visualization methods are used in the cluster methods tutorial? Explain why the selected method is the most appropriate method for the visualization. Bonus points: do this as well for the scanpy tutorial.

Scatter Plot: Scatter plots are commonly used to visualize clusters by representing data points as individual dots in a two-dimensional space. Each dot's position corresponds to its feature values, such as two variables or dimensions. Scatter plots are useful for visualizing the separation, density, and overlap of clusters, allowing for an intuitive understanding of the clustering structure.

Heatmap: Heatmaps are visual representations of a matrix of values, where each cell's color indicates the magnitude of the value. Heatmaps are often used to visualize the expression levels of genes or variables across different clusters. They provide a condensed view of cluster-specific patterns and allow for the identification of differential expression or patterns of interest within clusters.

t-SNE Plot: t-SNE (t-distributed stochastic neighbor embedding) is a dimensionality reduction technique specifically designed for visualization purposes. It reduces high-dimensional data to two or three dimensions while preserving local structures. t-SNE plots are effective for visualizing clusters as they capture the proximity and similarity of data points, revealing clusters and subclusters based on their relationships.

UMAP Plot: UMAP (Uniform Manifold Approximation and Projection) is another dimensionality reduction technique commonly used for visualizing high-dimensional data. UMAP emphasizes the preservation of both local and global structures in the data. UMAP plots are suitable for visualizing clusters as they provide an accurate representation of the data's clustering patterns and capture complex relationships between data points.

These visualization methods help in understanding and interpreting the clusters in the data. They provide insights into the spatial distribution of data points, patterns of gene expression, and relationships between clusters. By using a combination of scatter plots, heatmaps, t-SNE plots, and UMAP plots, analysts can gain a comprehensive understanding of the clustering structure and underlying patterns within the data.

### 3/What performance/evaluation metrics are in the cluster methods tutorial? Explain why the used methods are the most appropriate method for the evaluation.

In the cluster methods tutorial, common performance and evaluation metrics used for cluster analysis include:

1. **Silhouette Score**: The Silhouette Score measures the quality of clustering by assessing how well each data point fits within its assigned cluster compared to other clusters. It is calculated as the difference between the average intra-cluster distance and the average nearest-cluster distance. The Silhouette Score is a suitable method for evaluation as it provides a quantitative measure of the compactness and separation of clusters.

2. **Calinski-Harabasz Index**: The Calinski-Harabasz Index evaluates the clustering quality based on the ratio of the between-cluster dispersion to the within-cluster dispersion. A higher index value indicates better-defined and well-separated clusters. This index is appropriate for evaluation as it captures the cluster compactness and separation.

3. **Davies-Bouldin Index**: The Davies-Bouldin Index measures the similarity between clusters by considering the ratio of the average distance between points within clusters to the distance between cluster centers. A lower index value indicates better-defined and less overlapping clusters. The Davies-Bouldin Index is suitable for evaluation as it assesses both the intra-cluster and inter-cluster distances.

These evaluation metrics are commonly used because they provide quantitative measures of clustering performance, considering factors such as cluster separation, compactness, and overlapping. They help assess the quality and appropriateness of the clustering results objectively.

The Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index are widely used because they do not require ground truth labels for evaluation. This makes them suitable for unsupervised clustering tasks, where the true cluster assignments are unknown. They provide a means to compare different clustering algorithms or parameter settings based on their ability to produce well-defined and separated clusters.

By utilizing these evaluation metrics, analysts can quantitatively assess the quality and effectiveness of different clustering approaches, aiding in the selection of the most appropriate algorithm or parameter settings for the given dataset.