This is a personal playground repository to practice machine learning and data science techniques and algorithms using the most popular Python libraries for the job. I use this repo to collect and organise notes for a quick reference of things I learned during my studies.
The notebooks are written using Jupyter on my local machine. You can also use Kaggle or Google Colab to edit them on the cloud.
- Notice
- Credits
- Python prerequisites
- Getting started with PIP and Virtualenv
- Getting started with Anaconda
- Dependencies
- Tech Radar
- How to generate the requirements file
- How to upgrade the dependencies
- Useful PIP commands
- References
📚 The resources contained in this repository are notes written down while studying topics of Machine Learning and Data Science and exercises I wrote to practice. They reflect my understanding of the topic and as such, they are not meant to be used as an authoritative source of information and/or reference documentation about the subject they refer to.
🗒️ I take notes by summarising the concepts I read and/or watch. I also put together pieces from different materials I consult about a specific topic. I write stuff to conceptualise my own understanding of a broader topic or my thoughts about it. I even attempt to jot down ideas.
📦 I've built this repository mainly for myself, to have a place where to collect my notes and to practice. I made it public because it doesn't hurt to give other people access to it, in the hope it could be useful in the process.
⛔ Any mistake, blunder, typo, inaccuracy - if present - is there in good faith. No hard feelings. If you care enough about this work though, please report any of the above to me by raising an issue so that I can fix it. Even better, you can also raise a Pull Request if you like to propose a fix yourself.
❤️ If you feel grateful about this collection and feel that this work may have helped you even so slightly in any form or shape with your study, consider crediting this repo as a way to say thank you 😊
Machine Learning Playground by Simone Spaccarotella is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
I started to be interested in topics gravitating the Artificial Intelligence world since my academic studies at the Università della Calabria, when I was then studying topics like Intelligent Systems, Answer-Set Programming and this thing called Data Mining at the Department of Mathematics and Informatics
I'm currently attending a Level 7 staff apprenticeship in AI and Data Science with Cambridge Spark at the BBC.
I'm also expanding these concepts and beyond by reading and watching further material available on the internet (papers, resources, video).
I'd like to give a shout out to StatQuest with Josh Starmer. It's an excellent YouTube channel with a vast and "clearly explained" catalogue of concepts spanning from Statistics to Data Science and Machine Learning. For example, I can personally say that I'm now able to grasp the main concepts behind Encoder-Decoder and Transformer architectures thanks to Josh. We don't know each other nor I get compensated to say this, so please, "checkout the quest" and subscribe, it's worth your time.
I also attended courses on LinkedIn Learning, Pluralsight and Coursera as well as training courses provided by the BBC, spanning from introduction to ML to TensorFlow and Keras to data manipulation and visualisation with Pandas, Matplotlib and Seaborn.
Make sure to have a suitable stable version of Python
3.x and pip
installed on your machine. Consider using Pyenv to manage your Python versions.
The desired Python version can be set by running pyenv install 3.n.m
, where n
and m
are the minor and patch version respectively. If you are not sure which version to install, you can check the available ones by running pyenv install --list
.
Please read Python Version section to check what's the latest python version compatible with the installed packages.
Install Virtualenv in order to setup an isolated virtual environment to manage the Python project and dependencies (read this installation guide on how to).
You need to create a virtual environment with a clean installation of Python. The following command do so, by creating a folder called ml-playground
(which is automatically excluded from revision control) containing a vanilla installation of Python with just the initial depdendencies installed.
Create a virtual environment (only if you don't have an ml
folder yet)
virtualenv ml
Enable the virtual environment
source ml/bin/activate
Check the python interpreter used is the one from the virtual environment
which python
Install the required dependencies
pip install -r requirements.txt
Download the English pipeline for Spacy
python -m spacy download en_core_web_md
Start the development environment
jupyter lab
NOTE: remember to deactivate the virtual environment by running the
deactivate
command once finished or if you switch project. If you don't do this and runpython
in another project through the same terminal session, you'll be running the same local version of Python with dependencies you may not want or need.
- Download and install Anaconda on your machine
- Start the Anaconda Navigator
- Install and launch the Jupyter notebook or JupyterLab from the "home" tab
This is the list of the main DS libraries included in the requirements.txt
file.
- Jupyter
- JupyterLab
- Jupyter Widgets
- NumPy
- SciPy
- SymPy
- Statsmodels
- Pandas
- Polars
- Dask
- Dask ML
- Matplotlib
- Seaborn
- Plotly
- Scikit-Learn
- TensorFlow
- Keras
- PyTorch
- XGBoost
- LightGBM and LightGBM Python Package
- CatBoost
- Prophet
- AWS Wrangler
- Sagemaker
- PySpark
- Optuna
- Imbalanced Learn
- SHAP
- PyWhy
- SpaCy
- NLTK
- Gensim
- Hugging Face Transformers
- Hugging Face Diffusers
- MLFlow
- AutoKeras
The full list of dependencies directly installed via PIP is the following:
pip install flake8 black isort split-folders rdflib notebook jupyterlab ipywidgets voila numpy scipy sympy statsmodels pandas polars 'dask[complete]' distributed 'dask-ml[complete]' ydata-profiling sweetviz autoviz lux matplotlib seaborn plotly scikit-learn tensorflow tensorflow_datasets keras-tuner torch torchvision torchaudio xgboost lightgbm catboost prophet awswrangler sagemaker pyspark pyarrow optuna imbalanced-learn category_encoders shap lime anchor-exp dowhy econml causal-learn spacy gensim nltk lightfm transformers 'diffusers[torch]' mlflow autokeras
Read Tensorflow Software Requirements to check the latest Python version compatibility
Technology worth investigating:
If you want to generate a new "requirements" file or add/remove dependencies and update the existing one
pip freeze > requirements.txt
To upgrade the dependencies we first need to replace all ==
symbols in the requirements.txt
file with >=
, so that we unlock the version and allow PIP to download the latest. We then run the upgrade command, and finally freze the packages again with the ==
symbol to lock the latest versions.
Unlock the current versions
sed -i '' 's/[~=]=/>=/' requirements.txt
Upgrade to the latest versions
pip install --upgrade -r requirements.txt
Lock the latest versions
pip freeze > requirements.txt
To list all the installed libraries in site-packages
pip list
To "show" the details of a specific library
pip show numpy