# ML Ops Module


# 🚗 Welcome to WagonCab 🚗

# But what is a ML Engineer, exactly?

A Machine Learning (ML) Engineer is a professional who is involved in the development, deployment, and maintenance of machine learning systems. They combine knowledge of software engineering and machine learning to create ML models and systems that can be used to provide data-driven solutions. 

Some key responsibilities of an ML engineer are:

1. **Data Analysis**: ML engineers work with large amounts of data. They are responsible for cleaning, processing, and analyzing this data to extract meaningful insights that can be used to improve machine learning models.

2. **Model Development**: ML engineers use various machine learning algorithms and models to build ML systems. They select the appropriate model, train it using the available data, and then evaluate its performance.

3. **Software Engineering**: Unlike data scientists, who are primarily focused on the development of models, ML engineers also need to have strong software engineering skills. They need to be proficient in various programming languages such as Python, Java, or R, and they often need to write production-level code.

4. **System Design**: ML engineers design and build machine learning pipelines, which involve data ingestion, data transformation, model training, and model deployment. They must ensure that these systems are scalable and robust.

5. **Testing and Validation**: ML engineers are also responsible for validating the results of machine learning models. They ensure that the models are performing as expected and are reliable and secure.

6. **Deployment and Maintenance**: After the development and testing phase, ML engineers deploy the machine learning models into the production environment. They then monitor the performance of these models over time, fine-tune them as necessary, and maintain the overall system.

7. **Communication**: ML engineers often need to collaborate with other teams in the organization, including data scientists, software engineers, product managers, and sometimes even stakeholders. They need to be able to explain complex concepts in a way that non-technical team members can understand.

In terms of skills, an ML engineer would typically have a strong background in computer science and mathematics, along with expertise in machine learning algorithms and principles, proficiency in programming languages like Python, experience with big data platforms and tools, and familiarity with machine learning frameworks like TensorFlow or PyTorch. 

It's also increasingly common for ML engineers to have knowledge of Machine Learning Operations (MLOps), a practice that brings together ML, DevOps, and data engineering to automate and improve the process of machine learning model development.

A data scientist, on the other hand, is a professional who uses their knowledge in statistics, data analysis, and machine learning to extract insights and knowledge from complex and large volumes of data. The role of a data scientist can vary significantly across different organizations, but here are some common responsibilities:

1. **Data Analysis**: A major part of a data scientist's job is to analyze data to extract insights. This can involve data preprocessing, cleaning, and feature engineering to prepare data for analysis. They also interpret the data and provide insights that can be used for decision-making.

2. **Model Development and Machine Learning**: Data scientists create predictive models using machine learning algorithms. They select appropriate models, train them using available data, evaluate their performance, and fine-tune them to improve their accuracy.

3. **Data Visualization**: They present data in a graphical format that makes the information easy to understand. This could be for internal use, to help stakeholders understand the data and the insights derived from it.

4. **Communication**: Data scientists need to communicate complex data insights to non-technical team members and stakeholders. This can involve explaining the data, the methodology used, the conclusions drawn, and the implications for the organization.

5. **Decision-Making**: Based on the insights drawn from data, data scientists advise on decision-making processes in various business areas, such as product development, marketing strategy, or operational efficiency.

In terms of skills, data scientists typically have a strong background in mathematics and statistics, expertise in a programming language (commonly Python or R), experience with databases and SQL, knowledge of machine learning algorithms and principles, and proficiency in data visualization tools and techniques.

While there is some overlap between data scientists and ML engineers, data scientists usually focus more on analysis and insight extraction, and they often work with smaller datasets and less emphasis on production code. Where, ML engineers typically work on deploying machine learning models at scale, often dealing with larger datasets and production systems.


# ML Infrastructure

Machine learning infrastructure refers to the tools, systems, and technologies that are used to develop, deploy, and maintain machine learning models in a scalable and robust manner. It is the backbone that supports the end-to-end lifecycle of a machine learning project, from data collection and preprocessing, to training and serving models, to monitoring model performance. 

I'll just explain each of those steps in the context of a typical machine learning project:

1. **Data Transformation**: Before a machine learning model can be trained, the data used for training must be properly formatted and cleaned. This process is known as data transformation, and it often includes steps like:

   - **Data Cleaning**: Removing or correcting erroneous data, dealing with missing values, and eliminating duplicates.
   
   - **Feature Engineering**: Creating new features from existing ones to better represent the underlying patterns in the data. This might involve tasks like one-hot encoding for categorical variables, normalization or standardization of numerical variables and so on...
   
   - **Data Splitting**: Dividing the dataset into training, validation, and test sets. The training set is used to train the model, the validation set is used to tune hyperparameters and make decisions on the model design, and the test set is used to evaluate the model's performance on unseen data.

2. **Model Training and Development**: Once the data is ready, a machine learning model can be trained. This involves choosing an appropriate algorithm (like linear regression, decision trees, neural networks, etc.), and feeding the training data into this algorithm. The algorithm then 'learns' from this data by adjusting its parameters to minimize the discrepancy between its predictions and the actual values. This discrepancy is measured by a loss function, which the model aims to minimize.

3. **Model Inference**: After a model has been trained, it can be used to make predictions on new, unseen data. This process is called model inference. For instance, if you've trained a model to predict house prices based on features like size, location, and number of bedrooms, you could use this trained model to predict the price of a house that wasn't in your original training data.

4. **Integration**: Once a model has been trained and validated, it needs to be integrated into the larger system or application where it will be used. This could involve deploying the model to a server, setting up an API for other services to use the model, or integrating the model into a software application. This step also involves ensuring that the model can handle the scale of requests it will receive in a production environment, and setting up processes to monitor the model's performance and update or retrain the model as needed.

Each of these steps is crucial to creating a useful machine learning model. Mistakes or oversights at any step can significantly affect the model's performance, so each one needs to be carried out carefully and thoroughly.



Creating a robust machine learning infrastructure involves a combination of many or all of these components, and it often involves making trade-offs based on the specific needs and constraints of the organization or project.


# 🚗 Back to WagonCab 🚗

# Pyen vs Venv

`pyenv` and `venv` are both popular tools in the Python community, but they serve slightly different purposes:

1. **pyenv**: is used for managing multiple versions of Python on a single machine. You might use pyenv if, for example, you have some projects that require Python 3.7 and others that require Python 3.8. With pyenv, you can easily switch between these Python versions on a per-project basis.

   Pyenv does not inherently isolate Python packages between environments; it's just about managing Python versions. However, an additional tool called `pyenv-virtualenv` can be used in conjunction with pyenv to manage both Python versions and dependencies in isolation.

2. **venv**: is a module, included in Python 3.3 and later, used to create isolated Python environments. It lets you create an environment where you can install Python packages without affecting other environments or your system Python. This is useful when you have different projects that require different packages or versions of packages.

So basically:

- If you need to switch between multiple versions of Python, you might want to use pyenv.
- If you need to isolate Python packages to avoid conflicts between different projects, you can use venv.
- If you need to do both, you might consider using pyenv with the pyenv-virtualenv plugin.

It's important to note that there are many tools available for managing Python environments, and the best one for you depends on your specific use case. Other popular options include virtualenv, pipenv, and conda.

# 🚗 Back to WagonCab 🚗 - read

# WagonCab has the huge, public

# Your goal as an ML engineer will be to:

# 3 Unit 1) Train at Scale

# 1) Packaging & Virtual Env 101

In Python, a package is a way to organize and structure related modules (Python files) and sub-packages (nested packages) in a hierarchical manner. It allows you to group related functionality together, making it easier to manage and organize your code.

A package is essentially a directory (or folder) that contains a special file called __init__.py. This file serves as an indicator that the directory should be treated as a package. The __init__.py file can be empty, or it can contain initialization code that is executed when the package is imported.

By organizing your code into packages, you can create a modular and reusable structure for your Python projects. It helps avoid naming conflicts, provides a clear namespace, and allows for better organization and maintainability of code.

To use a package in Python, you typically import the specific modules or sub-packages you need from the package using the import statement. 

And you can share it with others by choosing a distribution method:
Some popular options include:

- PyPI (Python Package Index): PyPI is the official repository for Python packages. You can upload your package to PyPI, making it easily accessible to other Python users who can install it using tools like pip. To publish your package on PyPI, you'll need to create a setup.py file that describes your package and its dependencies.

- GitHub: Host your package on a version control platform like GitHub. This allows users to clone or download your package directly from the repository. You can also provide installation instructions and documentation in the repository's README file.

# Anatomy of a Minimal Python Package

**running as python file x python module**

The main difference between executing Python as a module and as a file is in how the code is used and accessed. Running Python as a module allows you to import and use the module's functionality within other code, providing a way to organize and reuse code. On the other hand, executing Python as a file runs the code directly from the file, making it suitable for standalone scripts or command-line tools.

# 1.2) Virtual Environments

Virtual environments are isolated Python environments that allow you to manage and control the dependencies and packages used by your Python projects. They provide a way to create an isolated environment where you can install specific versions of Python packages without interfering with the system-wide Python installation or other projects.

They are useful when you need to work on multiple projects with different package requirements or when you want to avoid conflicts between different versions of packages.

# Reminder on pyenv vs. venv

In a nutshell:
    
   Pyenv is primarily used for managing different versions of Python globally on your system, while venv is used to create isolated virtual environments specific to individual projects, allowing you to manage project-specific dependencies and packages. Both tools serve different purposes and can be used together in Python development workflows.

# 2) 💻 Installing & Using a Package

# 2.1) Install the Package

The setup.py file is a script used in Python packages to define the metadata and dependencies of the package, as well as to specify the installation process. It is a crucial component when distributing and installing Python packages using the Python package manager, such as pip.

The primary purpose of the setup.py file is to provide information about the package to the package management tools and to automate the installation process.

# 2.2) Run the Package from Anywhere (when totoenv is Activated)

# For instance, from a notebook

# ❓ Why Does it Work?

# site-packages contains all your pip packages.

The "site-packages" folder in Python is a location where third-party packages are installed by default. It is a directory where Python searches for and imports additional packages or modules that are not part of the standard library or the Python interpreter itself.

When you install a third-party package using tools like pip, the package and its associated files (including modules, libraries, and other resources) are typically installed in the "site-packages" directory. This allows Python to locate and import those packages when they are needed in your code.

# 2.3) Dependencies

# 2.4) Adding a Makefile to Create Simple CLI Commands ⚡️

A Makefile is a special type of file used in software development projects to define and automate the build process. It contains a set of rules and instructions that specify how to compile source code, link dependencies, generate executables or libraries, and perform other build-related tasks.

# 3) Testing your Package 🧪

Testing plays a crucial role in software development, and there are several reasons why it is important to incorporate testing into the development process:
- Identifying bugs and errors
- Ensuring functionality
- Maintaining code quality:
- Enabling refactoring and code changes
- Improving software maintainability


TDD stands for Test-Driven Development. It is a software development methodology that emphasizes writing tests before writing the actual code. The TDD approach follows a cycle of "Red-Green-Refactor" and involves the following steps:
- Write a test
- Run the test (Red)
- Write the code (Green)
- Run the test again
- Refactor With the test passing, you can refactor the code to improve its design, structure, or performance without changing its behavior. The aim is to maintain or enhance the code's quality while keeping the tests passing
- Repeat: The cycle repeats for each new functionality or change. you will continue writing new tests, implementing the necessary code to pass them, and then refactoring as needed.

Let's code an example together.

1. create toto/divide.py
2. vreate tests folder
3. create test_divide.py
4. paste code from slide - empy funtcion
5. pip install pytest
6. run pytest
7. pytest -v tests/test_divide.py

In [2]:
# Create divide.py

def divide_without_raising(x:float, y:float) -> float:
    '''
    divides x by y, but instead of raising errors when y equals 0, returns:
    - inf if x positive
    - -inf if x negative
    - nan if x equals 0
    '''
    x/y

but when x = 2.0, y = 0.0 it raise a xero divisio error - it breaks before the assertion. it breaks befora y == 0 so lets take care of that.

In [None]:
if y != 0.:
    return x/y
else:
    if x > 0.:
        return float('inf')

But now it doent work for -1. lets fix that!

In [None]:
    if x < 0.:
        return -1 * float('inf')
    if x == 0.:
        return float('nan')

Now our test passed. and thats the ideia of TDD. You get the idea... 

now we can add make test to our make file

Verbose output refers to additional detailed information displayed during the execution of a program or command. Verbose output provides more extensive and comprehensive feedback, often including additional details, debug information, or progress updates.Verbose output refers to additional detailed information displayed during the execution of a program or command. In the context of software development and testing, verbose output provides more extensive and comprehensive feedback, often including additional details, debug information, or progress updates.

# Data engineering tips

### Become a debugginf master

Debugging in data engineering often involves tracing errors in data pipelines, fixing data quality issues, and resolving performance bottlenecks. Here are some tips for debugging in the context of data engineering:

1. **Understand Your Data:** Always know what your data looks like, its format, and how it should behave. Check data samples regularly to ensure consistency. Using data profiling or data quality tools can help in this understanding. 

2. **Thorough Logging:** Log as much information as you can during your data pipeline processes. Include details like timestamps, data size, data source, data destination, and any errors or warnings. These logs will be invaluable when diagnosing problems.

3. **Use Monitoring Tools:** There are many tools available for monitoring data pipelines and jobs, such as Apache NiFi, AWS CloudWatch, Google Stackdriver, etc. These tools help in real-time debugging and alerting of any issues in your data pipelines.

4. **Unit Testing:** Test each component of your pipeline individually to ensure it functions as expected. This will make debugging easier, as you can isolate issues to specific units.

5. **Data Validation:** Validate your data at every stage of the pipeline. Check for nulls, incorrect formats, unexpected values, and other anomalies that could cause errors downstream.

6. **Performance Metrics:** Monitor the performance metrics of your pipelines like run-time, resource usage, and throughput. This can help identify bottlenecks or performance degradation over time.

7. **Replicate Issues Locally:** If you're debugging a complex issue, try to replicate it in a local or staging environment. This allows you to experiment and solve the problem without affecting the production environment.

8. **Version Control:** Keep your pipeline scripts/code in version control systems like Git. This allows you to track changes over time, understand what changed when a bug appeared, and easily roll back changes if needed.

9. **Incremental Testing:** When developing new features in your data pipeline, test incrementally. Don't wait until everything is built to begin testing. Test after adding each new feature or component.

10. **Document Your Pipelines:** Document your data pipelines, including each component and how data flows through them. This documentation will be incredibly useful when you're trying to understand how to debug a problem.

Remember, debugging is often about patience and systematic exploration. Take your time, be thorough, and you'll find the problem.

In [5]:
# divide.py

from toto.lib import who_am_i

def divide_without_raising(x:float, y:float) -> float:
    '''
    divides x by y, but instead of raising errors when y equals 0, returns:
    - inf if x positive
    - -inf if x negative
    - nan if x equals 0
    '''
    who_am_i()
        
    if y != 0.:
        return x/y
    else:
        if x > 0.:
            return float('inf')
        if x < 0.:
            return -1 * float('inf')
        if x == 0.:
            return float('nan')
        
if __name__ == '__main__':
    divide_without_raising(2., 1)

In [None]:
python toto/divide.py

So this is a stack trace. , is a report showing the nested subroutine calls made by a program at a specific point in time. they are typically used for debugging purposes to track the sequence of nested functions called - in reverse order - up until the moment the stack trace is generated.

Wwe can set trace manually as well, as a breakpoint for you code to pusposelly break at a certain point.
1. pip install ipdb

- add try except:
1. The try block contains code that might potentially raise an exception. When an exception is raised, the rest of the code in the try block is skipped.

2. After the try block, one or more except blocks catch and handle exceptions. Each except block specifies the type of exception it can handle. When an exception is raised in the try block, Python looks for an except block that handles this type of exception, and then executes that except block.

In [6]:
try:
    # This code may raise an exception
    x = 1 / 0
except ZeroDivisionError:
    # This code is executed if a ZeroDivisionError is raised
    print("You can't divide by zero!")
    
# In this example, the try block contains code that raises a ZeroDivisionError.
# Since the try block can't complete successfully, Python raises an exception.
# Then it finds the except block that handles ZeroDivisionError, and executes
# the code in that block, printing the message "You can't divide by zero!".

You can't divide by zero!


# Master your IDE

# VS Code Shortcuts (macOS)

# Notebooks

...