# Design, Implement and Analyse Bioinformetics with an example

## Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## Load data

In [2]:
# Load your dataset (replace 'your_dataset.csv' with your actual dataset)
df = pd.read_csv('human.csv')

# Explore the dataset
print(df.head())


                                            sequence  class
0  ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCA...      4
1  ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG...      4
2  ATGTGTGGCATTTGGGCGCTGTTTGGCAGTGATGATTGCCTTTCTG...      3
3  ATGTGTGGCATTTGGGCGCTGTTTGGCAGTGATGATTGCCTTTCTG...      3
4  ATGCAACAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTG...      3


## Select Features

In [3]:
# Separate features and labels
X = df['sequence']
y = df['class']


## Split data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Prepare a model

In [5]:
# Convert DNA sequences into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Create and train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

## Predict and Evaluate

In [6]:
# Predict on the test set
y_pred = model.predict(X_test_vectorized)

# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('\nClassification Report:')
print(classification_report(y_test, y_pred, zero_division=1))


Accuracy: 0.4223744292237443

Confusion Matrix:
[[  9   0   0   0   0   0  93]
 [  0  19   0   0   0   0  93]
 [  0   0   4   0   0   0  56]
 [  0   0   0  29   0   0 101]
 [  0   0   0   0  35   0 110]
 [  0   0   0   0   0   0  53]
 [  0   0   0   0   0   0 274]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.09      0.16       102
           1       1.00      0.17      0.29       112
           2       1.00      0.07      0.12        60
           3       1.00      0.22      0.36       130
           4       1.00      0.24      0.39       145
           5       1.00      0.00      0.00        53
           6       0.35      1.00      0.52       274

    accuracy                           0.42       876
   macro avg       0.91      0.26      0.26       876
weighted avg       0.80      0.42      0.35       876



Let's interpret these results:

### Accuracy:
- The overall accuracy of the model is approximately 42.2%, which indicates that the model correctly classified 42.2% of the samples in the test set.

### Confusion Matrix:
- The confusion matrix shows the distribution of predicted and actual classes. Similar to the previous case, it appears that the model is having challenges correctly classifying some classes.

### Precision, Recall, and F1-Score:
- Precision, recall, and F1-score for each class provide insights into the performance of the model for individual classes.
- The high precision for class 6 suggests that when the model predicts class 6, it is often correct.
- The low recall for several classes indicates that the model is not capturing all instances of those classes.

### Support:
- Support represents the number of actual occurrences of each class in the test set.
- Some classes have higher support than others, and the imbalances in support can impact the interpretation of metrics.

### Macro Avg and Weighted Avg:
- Macro avg and weighted avg provide overall performance metrics, with macro avg treating all classes equally, and weighted avg accounting for class imbalances.
- As before, the macro and weighted averages suggest imbalances in the model's performance across classes.

### Interpretation:
- The overall performance of the model is still suboptimal, and the imbalances in precision, recall, and F1-score indicate that the model struggles with certain classes.
- Class 6 continues to be well-predicted, but other classes show limitations.
- Consider investigating misclassifications, exploring different algorithms, tuning hyperparameters, and addressing potential data quality issues to improve the model's performance.

It's essential to carefully examine the context of the bioinformatics problem, understand the significance of each class, and determine whether the model's performance aligns with the application's requirements. Additionally, consider experimenting with more advanced algorithms and techniques, such as deep learning, depending on the nature of the data and the complexity of the problem.