# Introduction to GRA 4160

## Lecture 1

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

# Course overview

The vast amount of widely available data allows machines to solve challenging tasks without explicitly being programmed to do so, often outperforming existing methods based on domain knowledge and/or human experts.

In this course we focus on basic machine learning methods, both for supervised and unsupervised learning. We will look at how exactly machines "learn" from the data, and how they use the knowledge learned during training to solve tasks of interest.

The course covers both the theory (looking at the methods more rigorously than an introductory machine learning courses) and the practice of machine learning (using Python).

# Course structure

Traditional lectures will only be used to a small extent in GRA4160. Instead, we will rely more on assignments and readings as well as interactive learning activities that allow you to apply and practice the concepts and techniques covered in the course. This will hopefully engage and encourage you to think critically and creatively about machine learning.

In class, we will often work on and discuss hands-on lab sessions or exercises. These sessions will give you the opportunity to work with real or simulated data, build and evaluate machine learning models, and explore machine learning applications.

You will also have to do a mini project. This is a self-contained projects that allow you to apply the concepts and techniques covered in the course to a specific problem or scenario. These projects should be completed in small groups of 2-4 students. This is an opportunity for you to explore topics that are of particular interest and to apply what you have learned to new and challenging problems.

# Course content

Machine learning is a rapidly growing field that has the potential to revolutionize many industries and solve a wide range of problems. It is a valuable skill to have in today's job market, as it can open up many career opportunities in various fields and also to a larger extent in business. Machine learning involves solving complex problems and requires a strong foundation in mathematics, statistics, and computer science. It can also be highly creative, as you need to think critically and develop innovative solutions to real-world problems.

Here is an outline of the main material that will be covered in GRA 4160:

1. **Data preprocessing**: Techniques for preparing data for machine learning, including handling missing values, scaling and normalizing data, and selecting features.
2. **Machine learning basics**: The basics of machine learning, including definitions and terminology, the different types of machine learning (such as supervised, unsupervised, and reinforcement learning), and the types of problems that machine learning can solve.
3. **Supervised learning**: Techniques for training models on labeled data, including linear regression, logistic regression and ensemble methods such as random forest.
4. **Unsupervised learning**: Techniques for finding patterns in data without labeled examples, including clustering and dimensionality reduction.
5. **Model selection, evaluation and assessment**: The process of choosing and fine-tuning a machine learning model for a particular task.
6. **Ensemble methods**: Involve training multiple models and then combining their predictions to make a final prediction. These methods can be particularly effective when the individual models are diverse and make different types of errors.
7. **Neural networks**: The basics of neural networks, including their structure and function, and techniques for training them.
8. **Applications**: Examples of how machine learning is being used in various fields, such as image and speech recognition, natural language processing, and automation.

Software we will use:
1. **Scikit-Learn**: A free, open-source machine learning library for Python that provides a range of algorithms and tools for tasks such as classification, regression, clustering, and dimensionality reduction.
2. **Keras**: An open-source software library that provides a Python interface for building and training neural network models, running on top of other machine learning frameworks such as TensorFlow or PyTorch.

## 1. Data preprocessing

1. **Data cleaning**: Data cleaning involves identifying and correcting or removing errors or inconsistencies in the data. This can include handling missing values, removing duplicate records, and correcting errors in the data.
2. **Feature selection**: Involves selecting the most relevant and meaningful features to include in the model. This can help improve the performance of the model and reduce the risk of overfitting.
3. **Feature engineering**: Involves creating new features from existing data, or transforming existing features to make them more suitable for the model. This can include techniques such as normalization, standardization, and scaling.
4. **Data splitting**: Involves dividing the data into separate sets for training, validation, and testing. This is important to ensure that the model is evaluated on unseen data, and to prevent overfitting.
5. **Data transformation**: Involves applying transformations to the data in order to make it more suitable for the model. This can include techniques such as encoding categorical variables, scaling numerical variables, and applying dimensionality reduction techniques.

## 2. Machine learning basics

Machine learning is a branch of artificial intelligence that involves training computer systems to perform tasks without explicit programming. It is based on the idea that systems can learn from data, rather than being explicitly programmed to perform a specific task. Machine learning algorithms use statistical techniques to find patterns in data and make predictions or decisions based on those patterns.

There are several types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning.

In supervised learning, the goal is to learn a function that can map input data to output labels, using labeled training data.

Unsupervised learning involves finding patterns in data without labeled examples, and is used for tasks such as clustering and dimensionality reduction.

Reinforcement learning involves training an agent to take actions in an environment in order to maximize a reward signal, and is used for tasks such as controlling robots or playing games. We will not cover reinforcement learning in this course.

## 3. Supervised learning

Supervised learning is a type of machine learning where the algorithm is trained on labeled data. This means that the input data is accompanied by a label indicating the correct output. The goal of the algorithm is to learn a function that maps the input data to the corresponding output labels, based on the examples provided in the training data.

During the training process, the algorithm is presented with a large number of input-output pairs, and it uses these examples to learn the relationship between the inputs and the corresponding outputs. The algorithm then makes predictions on new, unseen data by applying the learned function. Supervised learning is often used for tasks such as image classification, speech recognition, and natural language processing, where the goal is to predict a specific output class or label given an input.

Examples of use cases in business:

1. **Predictive modeling**: Build predictive models that forecast future outcomes based on past data. This can be useful for predicting demand for a product, forecasting sales, or identifying potential risk in financial investments.
2. **Customer churn prediction**: Identify patterns in customer data that may indicate a high likelihood of churn (the loss of customers from a business or service). This can be useful for identifying at-risk customers and implementing retention strategies.
3. **Credit risk assessment**: Analyze financial data, such as credit history and income, to predict the likelihood of a borrower defaulting on a loan.
4. **Fraud detection**: Identify patterns in financial transactions that may indicate fraudulent activity.
5. **Sentiment analysis**: Classify text data, such as social media posts or customer reviews, as positive, negative, or neutral based on the sentiment expressed. This can be useful for understanding customer opinions or for monitoring the public perception of a company or product.

## 4. Unsupervised learning

Unsupervised learning is a type of machine learning where the algorithm is not given any labeled training data. Instead, the algorithm must discover the structure and patterns in the data through techniques such as clustering and dimensionality reduction.

One common application of unsupervised learning is clustering, where the goal is to group similar data points together into clusters. For example, a clustering algorithm could be used to group customers into different segments based on their purchase history.

Another application of unsupervised learning is dimensionality reduction, where the goal is to represent high-dimensional data in a lower-dimensional space while preserving as much of the original structure and patterns as possible. This can be useful for visualizing data or for reducing the number of features in a dataset before applying a supervised learning algorithm. Unsupervised learning is often used for tasks such as anomaly detection, density estimation, and data compression.

Examples of use cases in business:

1. **Customer segmentation**: Group customers into different segments based on their characteristics, behaviors, or preferences. This can be useful for targeted marketing or for tailoring products and services to different customer groups.
2. **Fraud detection**: Identify unusual patterns or behaviors in financial transactions that may indicate fraudulent activity.
3. **Market segmentation**: Group products or markets into different segments based on their characteristics or performance. This can be useful for identifying opportunities for growth or for targeting specific segments with different marketing strategies.
4. **Sentiment analysis**: Analyze text data, such as social media posts or customer reviews, to identify patterns or trends in the sentiment expressed. This can be useful for understanding customer opinions or for monitoring the public perception of a company or product.
5. **Supply chain optimization**: Identify patterns in supply chain data, such as demand patterns or supplier performance, to optimize logistics and improve efficiency.

## 5. Model selection, evaluation and assessment

Model selection refers to the process of choosing a machine learning model to use for a particular task, based on the characteristics of the data and the requirements of the task. This can involve comparing different types of models, such as linear regression, logistic regression, and support vector machines, to see which one performs the best.

Model evaluation refers to the process of measuring the performance of a machine learning model on a specific dataset. This typically involves splitting the data into a training set and a test set, training the model on the training set, and then evaluating the model's performance on the test set. Common metrics for evaluating the performance of a model include accuracy, precision, and recall.

Model assessment refers to the process of using a machine learning model to make predictions on new, unseen data. This can involve using the model to make predictions on a separate validation set, or using the model in a real-world application to see how well it performs. The goal of model assessment is to determine how well the model generalizes to new data, and to identify any potential issues or limitations.

## 6. Ensemble methods

Ensemble methods are techniques that involve training multiple models and then combining their predictions to make a final prediction. These methods are based on the idea that multiple models, each of which may be weak learners on their own, can come together to form a strong overall model. There are several types of ensemble methods, including boosting, bagging, and bootstrapped ensembles.

One of the main advantages of ensemble methods is that they can often improve the performance of a predictive model. This is because the individual models that make up the ensemble are trained on different subsets of the data, and may make different types of errors. By combining their predictions, the ensemble is able to achieve a better overall performance. Additionally, ensemble methods can be used to reduce the variance of a model, making it more robust and less sensitive to small changes in the training data.

There are also some limitations to ensemble methods. One potential drawback is that they can be more computationally expensive to train, since multiple models need to be fit to the data. Additionally, the final prediction made by the ensemble may be more difficult to interpret compared to a single model, since it is based on the combination of multiple models.

## 7. Neural networks

1. **Structure of a neural network**: A neural network consists of layers of interconnected "neurons," which are inspired by the structure and function of neurons in the brain. Each neuron receives input from other neurons and produces an output based on a weighted sum of those inputs and an activation function. The output of one layer serves as the input to the next layer, and the layers are stacked to form a network.
2. **Activation functions**: Activation functions are used to introduce non-linearity into the neural network. Common activation functions include sigmoid, tanh, and ReLU.
3. **Training a neural network**: Neural networks are trained using a process called backpropagation, which involves adjusting the weights and biases of the neurons based on the error between the predicted output and the true output labels. This is done using an optimization algorithm such as stochastic gradient descent.
4. **Overfitting and underfitting**: Overfitting occurs when a neural network is too complex and has too many parameters, which leads to poor generalization to new data. Underfitting occurs when the neural network is too simple and cannot capture the underlying patterns in the data.
5. **Convolutional neural networks (CNNs)**: CNNs are a type of neural network specifically designed for processing data with a grid-like structure, such as images. They are composed of convolutional layers, which apply a set of filters to the input data to extract features, and pooling layers, which down-sample the feature maps.
6. **Recurrent neural networks (RNNs)**: RNNs are a type of neural network that can process sequential data, such as time series or natural language. They are composed of recurrent layers, which maintain a state that can be passed from one time step to the next, and can be used for tasks such as language translation and language generation.

## 8. Applications

We will look at some applications in class, and you have to come up with your own applications for the mini project. Start early!

Some broad classes of applications:

1. **Image classification**: Use supervised learning techniques by training a model to classify images into different categories, such as animals or vehicles.
2. **Natural language processing**: Use NLP for processing and understanding text data, such as sentiment analysis or machine translation.
3. **Time series forecasting**: Use ML methods for predicting future values based on past data, such as stock price prediction or energy consumption forecasting.
4. **Recommender systems**: Use techniques for predicting user preferences, such as collaborative filtering or matrix factorization.
5. **Reinforcement learning**:  Train a computer agent to play a game by learning from trial and error.
6. **Anomaly detection**: Use unsupervised learning techniques by training a model to identify unusual patterns in data, such as fraud detection or fault detection in a manufacturing process.


## 9. Scikit-learn

Webpage: [scikit-learn](https://scikit-learn.org/stable/index.html)

Scikit-learn, also known as sklearn, is a free, open-source machine learning library for Python that provides a range of algorithms and tools for tasks such as classification, regression, clustering, and dimensionality reduction. It is built on top of the popular scientific computing libraries NumPy and SciPy, and it is designed to be easy to use and integrate well with other Python libraries, such as pandas and matplotlib.

One of the key features of scikit-learn is its uniform interface, which allows users to easily switch between different algorithms and use them in a consistent way. The library also includes a number of convenience functions, such as train_test_split for dividing data into training and test sets, and cross_val_score for evaluating models using cross-validation.

Scikit-learn is widely used in the field of machine learning, and it is particularly popular for tasks such as classification, regression, and clustering. It is also commonly used in data science and scientific computing, and it has been cited in many academic papers and articles. The library is actively maintained and developed, and it has a strong community of users and contributors.

## 10. Keras

Webpage: [Keras](https://keras.io)

Keras is a user-friendly, high-level library for building and training neural networks in Python. It is built on top of other popular machine learning libraries, such as TensorFlow and PyTorch, and it provides a range of functions and classes for defining and training neural networks.

One of the key features of Keras is its simplicity and ease of use. It has a clear and intuitive API, which makes it easy for users to build and train neural networks with minimal code. Keras also has a range of built-in functions and tools for tasks such as data preprocessing and model evaluation, which can save time and make it easier to get up and running with neural networks.

In addition to its simplicity and ease of use, Keras is also highly flexible and modular. It allows us to define custom neural network architectures and to use a variety of different optimization algorithms and loss functions. It also supports both convolutional and recurrent neural networks, which makes it suitable for a wide range of tasks, including image classification, natural language processing, and time series analysis.

Overall, Keras is a popular choice for building and training neural networks in Python, and it is widely used in the machine learning community. Its user-friendly interface and range of features make it a good option for beginners and experienced users alike.