# Submodule 1 - Introduction


## Overview
This submodule will start with a brief introduction of AI/ML, its history and diverse applications. Then focus on some key concepts used in AI/ML. We will compare different AI/ML algorithms and make recommendations for using different AI/ML algorithms for different data types in biomedical research. You will learn core concepts, diverse applications, introductory algorithms, and data challenges.

## Learning Objectives
At the end of this module, you should be able to:

+ Understand the difference/relation between artificial intelligence (AI) and machine learning (ML)
+ Learn key concepts in ML, learning types and algorithms
+ Familiarize with some applications of AI/ML in the biomedical field

## Prerequisites
* An AWS account with access to Amazon SageMaker
* Basic understanding of Python programming

## Get Started
- Watch the Lecture Videos.
- Complete the Quizzes to solidify your understanding.
- Enhance your programming skills with Tutorials.
- Challenge yourself with the Exercises.

## 1. Introduction to Artificial Intelligence (AI) and Machine Learning (ML)

THis lecture will introduce you to the field of Artificial intelligence and machine learning.

We will start a new topic with some historical background as it helps to understand the evolution of the field and how knowledge builds in many cases aligned with the technological advancements. We will provide a comprehensive overview of Artificial Intelligence (AI) and Machine Learning (ML), introducing both the concepts and their practical applications. AI, defined by pioneer John McCarthy as the “science and engineering of making intelligent machines,” aims to create systems that can replicate human understanding, thinking, learning, and behavior. The applications of AI span numerous fields, including agriculture, where it’s used for robotics, crop monitoring, and predictive analysis, and healthcare, where it assists in diagnostics. In the biomedical sphere, AI supports complex tasks like protein structure prediction, data categorization, and summarization.

The distinction between AI and ML is key to understanding the field. While AI broadly represents the ambition to emulate human-like abilities, ML is a specific subset of AI that focuses on training machines to learn from data. This learning process is structured around three main components: the task (T), which defines the problem to solve (e.g., classification or regression), experience (E), where an algorithm learns patterns from data, and performance (P), a metric (such as accuracy or F1 score) used to assess how well the model performs the task over time.

ML methods are categorized primarily by the amount of human supervision involved: supervised learning relies on labeled data to predict outputs, unsupervised learning finds patterns without labels, and reinforcement learning teaches an agent to optimize actions based on feedback (rewards or penalties). Models generated by ML algorithms represent the learned relationships between inputs and outputs. These models can be applied either in batch (offline) mode, where a fixed dataset trains the model before deployment, or online mode, where continuous learning from new data keeps the model updated. Model-based learning and instance-based learning further divide methods based on whether predictions come directly from a model or are based on stored examples.

Training and evaluating ML models is crucial, involving phases like training (optimizing model parameters), validation (monitoring for overfitting), and testing (assessing model performance on new data). Cross-validation, where the data is split into several subsets, helps refine model performance across different conditions. Understanding the types of data is also fundamental, as different data structures (categorical, continuous) influence the choice of algorithms and model approaches.

Overall, this lecture gives a structured framework to approach AI and ML, emphasizing how models are created, trained, and evaluated across diverse applications.

### Lecture Video

In [None]:
from IPython.display import YouTubeVideo

# Youtube
YouTubeVideo(id='intro_to_ai_ml', height=200, width=400)

### Lecture Slides

Download the lecture slides [Introduction to AI and ML](Submodule_1/Lectures/Submodule_1_Lecture_1_AI_MLintro.pptx).

### Quizzes

In [None]:
%pip install jupyterquiz
from jupyterquiz import display_quiz
display_quiz("Submodule_1/Quizzes/Intro_to_AL_ML_Quiz.json")

## 2. AI/ML Basic Concepts
In this section you will learn some foundational AI/ML concepts that you will encounter in the subsequent submodules.

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on teaching computers to learn from data and make predictions or decisions without explicit programming. At the core of ML are several fundamental concepts:   

**Classes and Labels**

In ML, we often deal with datasets containing instances that belong to specific categories or classes. For example, in image classification, we might have classes like "cat" and "dog." To represent these classes, we assign labels to each instance, which could be numerical codes or textual descriptions.


**Loss Functions**

A loss function measures the discrepancy between a model's predictions and the actual ground truth values. It's a crucial component of training, as it guides the model to adjust its parameters in a way that minimizes the loss. Common loss functions include mean squared error for regression problems and cross-entropy loss for classification problems.   

**Parameters and Hyperparameters**

Parameters are the variables within a model that are adjusted during the training process to minimize the loss function. They are learned from the data itself. Hyperparameters, on the other hand, are settings that are configured before training starts and influence the learning process. Examples of hyperparameters include learning rate, batch size, and the number of epochs.

**Inductive Bias**

Inductive bias refers to the assumptions or prior knowledge that a model brings to a learning task. It shapes the model's learning process and can influence the types of patterns it is able to recognize. For instance, a neural network designed for image recognition might have an inductive bias towards recognizing spatial patterns.

**Bias and Variance**

Bias and variance are two key concepts in ML that relate to the accuracy and consistency of a model. Bias refers to the systematic error of a model, while variance measures how much a model's predictions change with different training sets. The goal is to find a balance between bias and variance, avoiding both underfitting (high bias) and overfitting (high variance).

**Generalization**

Generalization is the ability of a model to perform well on unseen data. A good model should be able to generalize its knowledge to new examples, rather than simply memorizing the training data. Overfitting and underfitting can hinder generalization.

By understanding these core concepts, one can gain a solid foundation in machine learning and appreciate the power and complexity of AI systems.


### Lecture Video

In [None]:
from IPython.display import YouTubeVideo

# Youtube
YouTubeVideo(id='basic_concepts_in_ai_ml', height=200, width=400)

### Lecture Slides

Download the lecture slides [Basic Concepts in AI and ML](Submodule_1/Lectures/Submodule_1_Lecture_2_AL_ML_basic_concepts.pptx).

### Quizzes

In [None]:
%pip install jupyterquiz
from jupyterquiz import display_quiz
display_quiz("Submodule_1/Quizzes/AL_ML_Basic_Concepts_Quiz.json")

## 3. AI/ML Applications
In this section you will go over some general applications of AI/ML in the biomedical doamin. We will explore the exciting intersection of Artificial Intelligence (AI) and Machine Learning (ML) with various biological data types. It highlights potential applications and challenges in analyzing:

**Nucleic Acid Sequence**

Machine learning models can be used to predict DNA accessibility, 3D genome organization, and interactions between regulatory elements like enhancers and promoters. Common models for these tasks include 1D CNNs, RNNs, and Transformers. However, repetitive regions and very long sequences in genomes can pose challenges.

**Protein Sequences**

Predicting protein structure, function, and interactions with other proteins are crucial areas where ML can be applied. 2D CNNs with residual networks, transformers, and multilayer perceptrons are some recommended models. A significant challenge here is data leakage from homology, where models might exploit similarities between proteins to make predictions, compromising true generalizability.

**Protein 3D Structures**

Machine learning can refine protein models, assess their quality, and predict how mutations might affect stability.  While models like GCNs and 3D CNNs show promise, a lack of data, particularly for protein complexes and disordered proteins, hinders progress.

**Gene Expression Data**

Understanding how genes are expressed and interact is essential for biological research. Clustering, CNNs, and autoencoders are some models used to analyze gene expression data. However, the unclear link between co-expression and gene function, high dimensionality of data, and inherent noise present significant challenges.

**Mass Spectrometry Data**

Identifying peaks in mass spectrometry data and annotating metabolites are potential applications of ML. CNNs using spectral data and traditional methods using derived features are both employed. Standardization across datasets and a lack of established benchmarks are hurdles in this area.

**Images**

Machine learning excels at image analysis tasks like medical image recognition and reconstructing cryo-EM images. While 2D CNNs, autoencoders, and traditional methods are commonly used, systematic differences in data collection techniques and challenges in obtaining large, consistent datasets can affect model performance.

**Molecular Structures**

Predicting properties like antibiotic activity, drug toxicity, and protein-ligand docking are valuable applications of ML in drug discovery. CNNs using molecular graphs, traditional methods, RNNs using text-based representations, and autoencoders are all employed. However, the limited availability of experimental data for most small molecules remains a major obstacle.

**Protein-Protein Interaction Networks**

These networks reveal protein interactions, which are crucial for understanding cellular processes. CNNs and graph embedding techniques are used to analyze these networks. However, incomplete interaction data and the influence of cellular location on protein interactions pose challenges for accurate predictions.

By leveraging the power of AI and ML, researchers are unlocking new avenues for understanding biological systems and developing novel drugs and therapies. The text emphasizes that along with exciting opportunities, these applications come with various challenges that need to be addressed for robust and generalizable models.

#### Lecture Video

In [None]:
from IPython.display import YouTubeVideo

# Youtube
YouTubeVideo(id='applications_in_ai_ml', height=200, width=400)

### Lecture Slides

Download the lecture slides [Applications in AI and ML](Submodule_1/Lectures/Submodule_1_Lecture_2_AL_ML_applocations.pptx).

### Quizzes

In [None]:
%pip install jupyterquiz
from jupyterquiz import display_quiz
display_quiz("Submodule_1/Quizzes/AL_ML_Applications_Quiz.json")

## 4. Tutorials
+ [Getting started with Jupyter Notebooks on Amazon SageMaker](Submodule_1/Tutorials/Submodule_1_Tutorial_1_Introduction_to_Jupyter_on_AWS.ipynb)
+ [Introduction to NumPy](Submodule_1/Tutorials/Submodule_1_Tutorial_2_Introduction_to_NumPy.ipynb)
+ [Introduction to Pandas](Submodule_1/Tutorials/Submodule_1_Tutorial_3_Introduction_to_Pandas.ipynb)
+ [Introduction to Biology with Python](Submodule_1/Tutorials/Submodule_1_Tutorial_4_Introduction_to_Biology_with_Python.ipynb)

## 5. Exercises
+ [NumPy Exercise](Submodule_1/Exercises/Submodule_1_Exercise_NumPy.ipynb) ([Solution](Submodule_1/Exercises/Submodule_1_Exercise_NumPy_Solution.ipynb))
+ [Pandas Exercise](Submodule_1/Exercises/Submodule_1_Exercise_Pandas.ipynb) ([Solution](Submodule_1/Exercises/Submodule_1_Exercise_Pandas_Solution.ipynb))

## Conclusions
Artificial intelligence (AI) and machine learning (ML) are rapidly transforming various industries, including biomedicine. AI/ML, at its core, involves the development of algorithms that enable computers to learn from data and make intelligent decisions. These techniques have found significant applications in biomedicine, such as drug discovery, medical image analysis, personalized medicine, and disease diagnosis. By analyzing vast amounts of biomedical data, AI/ML algorithms can identify patterns, make accurate predictions, and accelerate the development of new treatments and therapies. The integration of AI/ML in biomedicine holds the potential to revolutionize healthcare and improve patient outcomes.

## Clean up
A reminder to shutdown VM and delete any relevant resources. <br><br>