## Synthetic data generation
**Tasks**: Creating artificial data samples that mimic real-world datasets\
**Tools**: TensorFlow Probability, PyTorch, SDV (Synthetic Data Vault), GANs (Generative Adversarial Networks)\
**Purpose and Applications**: TensorFlow Probability and PyTorch enable probabilistic modeling, while SDV provides a framework for generating synthetic data based on statistical models. GANs, a well-known generative modeling technique, excel in creating realistic data samples. These tools are used in healthcare for creating synthetic patient records, in finance for creating simulated financial transactions, and for training machine learning models when the original data is scarce or sensitive

---
## Image generation and manipulation
**Tasks**: Generating synthetic images, modifying visual attributes, and creating new designs\
**Tools**: StyleGAN, DALL-E, BigGAN\
**Purpose and Applications**: StyleGAN, a specialized implementation and extension of GANs, can create high-quality images, DALL-E's can create images based on text descriptions, and BigGAN can generate diverse and realistic images. These tools have applications in the art and design, content creation, fashion, and gaming industries. They enable the creation of unique images and enhance creative workflows.

---
## Natural language generation
**Tasks**: Generating human-like text, creating stories, articles, or dialogue\
**Tools**: OpenAI's GPT (Generative Pre-Trained Transformer) models, Hugging Face's Transformers\
**Purpose and Applications**: GPT models can generate coherent and context-appropriate text, and transformers provide flexibility in text generation and fine-tuning specific tasks. Data professionals use these tools in chatbots, content generation, automated summarization, and conversational AI in media, customer service, and content creation.

---
## Music and audio synthesis
**Tasks**: Generating musical compositions or synthesizing audio samples\
**Tools**: Magenta, Jukebox, NSynth\
**Purpose and Applications**: Magenta can generate melodies, harmonies, and musical compositions using deep learning techniques. Jukebox can create new songs in various genres, and NSynth can generate new sounds by combining existing ones. Artists use these tools in music production, gaming, and entertainment for creating original compositions, sound effects, and adaptive soundtracks.

---
## Simulation and data augmentation
**Tasks**: Simulating scenarios and augmenting datasets for machine learning models\
**Tools**: Unity ML-Agents, NVIDIA's SimNet, Augmentor\
**Purpose and Applications**: Unity ML-Agents can create intelligent agents for simulations, SimNet can simulate realistic data, and Augmentor provides data augmentation techniques. Data professionals use these tools in robotics, gaming, autonomous vehicles, and simulations for training AI models and testing algorithms in different environments.

---
## Content generation
**Tasks**: Creating human-like content, such as text, images, and music\
**Tools**: OpenAI's GPT models, DeepDream, StyleGAN\
**Purpose and Applications**: GPT models excel in generating coherent text, DeepDream creates surreal images, and StyleGAN ensures realistic image generation. Content creators use these tools in content creation, storytelling, art, and the entertainment industries.

---
## Anomaly detection
**Tasks**: Identifying outliers or anomalies in datasets\
**Tools**: Autoencoders, Isolation Forest, GANs\
**Purpose and Applications**: Autoencoders can detect anomalies or outliers in the data, Isolation Forest can effectively handle anomaly detection in high-dimensional data, and GANs can generate normal data distributions. Data professionals use these tools to detect financial fraud, manufacturing errors, and cybersecurity.

---
## Data augmentation
**Tasks**: Enhancing training datasets by generating variations of existing data\
**Tools**: CycleGAN, Augmentor, Neural Style Transfer\
**Purpose and Applications**: CycleGAN can perform an image-to-image translation, Augmentor can generate augmented images, and Neural Style Transfer allows the artistic transformation of images based on the style of one image and the content of another. Data professionals use computer vision, medical imaging, and data augmentation tools for machine learning models.

---
## Human-computer interaction
**Tasks**: Enabling human-like interactions through chatbots, assistants, and avatars\
**Tools**: Dialogflow, Rasa, RunwayML\
**Purpose and Applications**: Dialogflow and Rasa effectively build conversational AI, whereas RunwayML suits creative coding. These tools are used in customer service, virtual assistants, and gaming industries to enhance the user experience.

--- 

### Guide to Choosing a Generative AI Model Type

| Model                       | Key Features                                                                                                                                                                                                                                          | Applications                                                                                                                                                                                                         |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Generative adversarial networks (GANs) | <br> - Two competing neural networks: generator and discriminator.<br> - The generator learns to create realistic data, while the discriminator learns to distinguish real from fake.<br> - The adversarial training process continuously improves both networks.<br> - Can be challenging to train and achieve stable results. | <br> - **Image generation**: faces, landscapes, objects.<br> - **Text generation**: poems, code, scripts.<br> - **Video generation**: realistic videos, animation.<br> - **Drug discovery**: generate molecules with intended properties.<br> - **Music generation**: composing new songs |
| Variational autoencoders (VAEs) | <br> - Encode input data into a lower-dimensional latent space.<br> - Learn a probability distribution over the latent space.<br> - Decode samples from the latent space to generate new data points.<br> - Focuses on learning a meaningful representation of the data | <br> - **Image compression**: efficiently stores and transmits images.<br> - **Anomaly detection**: identify unusual data points.<br> - **Dimensionality reduction**: compress high-dimensional data.<br> - **Text summarization**: generate concise summaries of text documents |
| Autoregressive models       | <br> - Generate data point by point, conditioned on previously generated points.<br> - Use recurrent neural networks (RNNs) or transformers to capture long-term dependencies.<br> - Can be computationally expensive for long sequences                                          | <br> - **Text generation**: realistic and coherent text sequences.<br> - **Music generation**: generating music that follows genre and style.<br> - **Time series forecasting**: predicting future values of a time series.<br> - **Image inpainting**: filling in missing parts of an image             |
| Diffusion models            | <br> - Start with a simple noise and gradually "de-noise" it into realistic data.<br> - Use a U-Net architecture with skip connections to preserve information.<br> - Can be more stable and easier to train than GANs, but often slower                                       | <br> - **Image generation**: high-quality and diverse images.<br> - **Text generation**: coherent and grammatically correct text.<br> - **Audio generation**: realistic and musical audio.<br> - **Inpainting and denoising**: improving the quality of images or audio                     |
| Flow-based models          | <br> - Transform a simple distribution (Gaussian) into a complex one using invertible transformations.<br> - Learn the parameters of these transformations from the data.<br> - Can be efficient and accurate for high-dimensional data, but training can be challenging            | <br> - **Image generation**: realistic and diverse images.<br> - **Density estimation**: modeling the probability distribution of data.<br> - **Dimensionality reduction**: compress high-dimensional data.<br> - **Anomaly detection**: identify unusual data points                   |



---
### Comparison of Models on Different Considerations

| Feature                  | GANs                         | VAEs                                | Autoregressive models              | Diffusion models              | Flow-based models               |
|--------------------------|------------------------------|-------------------------------------|------------------------------------|-------------------------------|--------------------------------|
| Data type                | Images, text, audio          | Images, text, continuous data       | Images, text, sequences            | Images, text                 | Images, continuous data        |
| Task objective           | High-fidelity generation, data augmentation | Encoding/decoding, representation learning | Sequence generation, text-to-image translation | Image generation, editing, inpainting | Image generation, conditional generation |
| Quality of samples       | High-fidelity, diverse       | Often blurry, less realistic        | Sharp, high-resolution             | High-fidelity, diverse      | High-fidelity, controllable    |
| Control over generation | Limited                      | Moderate                            | High                               | Moderate                     | High                           |
| Training complexity      | High                         | Moderate                            | High                               | Moderate                     | High                           |
| Interpretability         | Low                          | Moderate                            | High                               | Moderate                     | Low                            |

---

# Module 1 Cheatsheet: Data Science and Generative AI

## Popular GenAI tools

| Name of model | Usage | Link |
| --- | --- | --- |
| Data Robot | A simple tool useful for data analysis and model building operations | [Data Robot](https://www.datarobot.com/) |
| Mostly.AI | Synthetic data generation | [Mostly.AI](https://mostly.ai/) |
| ChatGPT | GPT based model used for text and code generation based on natural language queries | [ChatGPT](https://openai.com/chatgpt) |
| DB Sensei | Generate SQL queries for databases using natural language queries | [DB Sensei](https://dbsensei.com/) |

---
## Important prompts for data preparation

| Task | Prompt |
| --- | --- |
| Read a CSV data file and load it to a data frame. | Write a Python code that can perform the following tasks: <br>Read the CSV file, located on a given file path, into a Pandas data frame, assuming that the first rows of the file are the headers for the data. |
| Data cleaning: Identify and replace missing values per the following guidelines. | Write a Python to perform the following tasks:<br>1. Identify the attributes with missing values.<br>2. Segregate these attributes into categorical and continuous valued attributes.<br>3. Drop the entire row if the value is missing in the target variable.<br>4. If the value is missing in a categorical attribute, replace the missing values with the most frequent value in the column.<br>5. If the value is missing in a continuous value attribute, replace the missing values with the mean value of the entries in the column. |
| Data Normalization: Normalize an attribute to its maximum value. | Write a Python code to normalize the content under a given attribute in a data frame df to its maximum value. Make changes to the original data, and do not create a new attribute. |
| Converting categorical variable into indicator variables | Write a Python code to perform the following tasks.<br>1. Convert a data frame df attribute into indicator variables, saved as df1, with the naming convention "Name_<unique value of the attribute>".<br>2. Append df1 into the original data frame df.<br>3. Drop the original attribute from the data frame df. |

---


# Module 2 Cheatsheet: Use of Generative AI for Data Science

## Popular GenAI tools

| Name of model | Usage                                                    | Link                                    |
|---------------|----------------------------------------------------------|-----------------------------------------|
| Hal9          | EDA tool to identify key insights on data                | [https://www.hal9.com/](https://www.hal9.com/)       |
| Columns.ai    | Data visualization tool to create useful charts           | [https://columns.ai/](https://columns.ai/)           |
| Akkio         | Data visualization tool to create data plots like regression plots, box plots, correlation heatmaps, and so on | [https://www.akkio.com/](https://www.akkio.com/) |

---

## Important prompts for generating data insights and visualizations

| Task                                                                      | Prompt                                                                                                                                                                                                                                                                                                                                                           |
|---------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Generate a statistical description of data.                               | Write a Python code to generate the statistical description of all the features used in the data set. Include "object" data types as well.                                                                                                                                                                                                                      |
| Create regression plots between a target variable and a continuous valued source variable. | Write a Python code to generate a regression plot between a target variable and a source variable of a data frame.                                                                                                                                                                                                                                             |
| Create box plots between a target and categorical source variable.        | Write a Python code to generate a box plot between a target variable and a source variable of a data frame.                                                                                                                                                                                                                                                    |
| Evaluate parametric interdependence using correlation, p-value and pearson coefficient | Write a Python code to evaluate correlation, pearson coefficient, and p-values for all attributes of a data frame against the target attribute.                                                                                                                                                                                                              |
| Group variables to create pivot tables. Create a p-color plot for the pivot table. | Write a Python code that performs the following actions: <br> 1. Groups three attributes as available in a data frame df. <br> 2. Creates a pivot table for this group, using a target attribute and aggregation function as mean. <br> 3. Plots a pcolor plot for this pivot table.                                                                  |

---

## Important prompts for model development and refinement

| Task                                                                      | Prompt                                                                                                                                                                                                                                                                                                                                                           |
|---------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Linear regression between a single source attribute and target attribute and evaluate it | Write a Python code that performs the following tasks: <br> 1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a target variable. <br> 2. Calculates and displays the MSE and R^2 values for the trained model.                                                                                                          |
| Linear regression between multiple source attributes and target attributes and evaluate it | Write a Python code that performs the following tasks: <br> 1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the attributes as a target variable. <br> 2. Calculates and displays the MSE and R^2 values for the trained model.                                                                           |
| Polynomial regression model with single source and target variable        | Write a Python code that performs the following tasks: <br> 1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the source variable and another as a target variable. <br> 2. Calculates and displays the MSE and R^2 values for the trained models. <br> 3. Compares the performance of the models. |
| Pipeline creation for scaling, polynomial feature creation, and linear regression | Write a Python code that performs the following tasks: <br> 1. Create a pipeline that performs parameter scaling, polynomial feature generation, and linear regression. Use the set of multiple features as before to create this pipeline. <br> 2. Calculate and display the MSE and R^2 values for the trained model.                          |
| Grid search with ridge regression and cross validation                   | Write a Python code that performs the following tasks: <br> 1. Use polynomial features for some of the attributes of a data frame. <br> 2. Perform a grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input. <br> 3. Use cross-validation in the grid search. <br> 4. Evaluate the resulting model's MSE and R^2 values.               |
