

## Common preprocessing steps:
### a. Quality control (Inspecting and cleaning the data):
This part typically includes reading the file, inspecting various aspects of the data such as its size/shape, presence of NaN values, and understanding the type of data. After inspecting the data, the next step is to clean it, which involves handling missing values, removing duplicates, or addressing any other data quality issues that may be present.

### b. Normalizing and Scaling:
is a common preprocessing step that aims to adjust data to a standard range or distribution. It is performed to address differences in magnitudes or scales between different features or variables in the data. Normalization is particularly important when comparing or combining variables that have different units or ranges of values. 

### c. Feature selection:

Feature selection involves identifying and selecting a subset of relevant features from the dataset. This step is beneficial when dealing with high-dimensional data, as it helps reduce noise and computational complexity, leading to improved clustering results. By selecting the most informative features, we can focus on the essential aspects of the data, leading to more effective clustering outcomes.

### d. Dimensionality reduction:
Dimensionality reduction techniques, such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE), are used to reduce the dimensionality of the data while retaining its key structure. This step helps visualize and analyze the data effectively by reducing noise and highlighting underlying patterns.

These preprocessing steps should be executed to ensure the data is of high quality, comparable, and suitable for downstream analysis. However, the specific execution may vary depending on the dataset and analysis goals. For example, if the dataset is already preprocessed and normalized, one may skip those steps and proceed directly to dimensionality reduction and clustering analysis.




# Tutorial_Clustering_Methods

## Visualization methods are used in the cluster methods tutorial
The visualization method used is a dendrogram. The selected method is appropriate for the visualization because it provides insights into the relationship and similarity between clusters, showing which clusters are more similar to each other and how they are connected. Hierarchical clustering groups similar data points into clusters based on a measure of similarity or dissimilarity. The dendrogram can handle large datasets by using truncation or collapsing methods to make the visualization more manageable. In the provided code, the `truncate_mode='lastp'` parameter is used to truncate the dendrogram and show only a subset of the most recent clusters (50 clusters in this case). 


# Tutorial_cluster_scanpy_object
## Performance/evaluation metrics:

ROC-AUC is a commonly used metric for evaluating binary classification models. The ROC-AUC score ranges from 0 to 1, with a higher value indicating better model performance. In this case, the roc_auc_score function from scikit-learn is used to calculate the ROC-AUC score for each iteration of the loop. 

ROC-AUC has been chosen for this case because the y is binary.

------


Fatemeh and I collaborated on this assignment.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Read the breast cancer dataset
data = pd.read_csv(r'../Data/breast-cancer.csv')
data.head()

In [None]:

# Select the relevant features for clustering analysis
selected_features = data[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean']]
selected_features.head()

In [None]:
#  Normalize or scale the features using StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform
selected_features = data[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean']]
scaled_data = scaler.fit_transform(selected_features)

# Perform PCA to reduce the dimensionality of the data
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)
pd.DataFrame(reduced_data, columns=['PC1', 'PC2'])

In [None]:
#  Perform clustering analysis on the preprocessed data
clustering_algorithm = KMeans(n_clusters=2)  # Choose the number of clusters
clusters = clustering_algorithm.fit_predict(reduced_data)

#  Visualize the clusters
plt.scatter(reduced_data[clusters == 0, 0], reduced_data[clusters == 0, 1], c='blue', label='Cluster 1')
plt.scatter(reduced_data[clusters == 1, 0], reduced_data[clusters == 1, 1], c='red', label='Cluster 2')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Clustering Results')
plt.legend()
plt.show()