2A
Cours animé par : Xavier Dupré (ENSAE 1999)1, Anne Muller (ENSAE 2012)2, Elodie Royant (ENSAE 2008)3, Antoine Thabault (ENSAE 2012)4, Nicolas Rousset5, Antoine Ly (ENSAE 2015), Benjamin Donnot (ENSAE 2015), Gaël Varoquaux6.
Contributeurs : Jérémie Jakubowicz (ENSAE 2002)7, Gilles Drigout (ENSAE 2013)8
Ce cours s'étale sur 6 séances de cours/TD d'une durée de 4h. Les outils proposés sont en langage Python. Ils sont tous open source, pour la plupart disponibles sur GitHub et en développement actif. Python est récemment devenu une alternative plus que probante pour les scientifiques et comme c'est un langage générique, il est possible de gérer l'ensemble des traitements appliqués aux données, depuis le traitements des sources de données jusqu'à leur visualisation sans changer de langage.
Le cours est prévu pour des profils plutôt statistiques ou plutôt économiques . Ces images reviendront pour indiquer si les contenus s'adressent plutôt aux uns ou aux autres. La présentation ENSAE 2A - Données, Machine Learning et Programmation donne un aperçu des thèmes abordés.
feuille de route 2016 <l-feuille-de-route-2016-2A>
compétitions <td2A-competition-ml>
projet informatique <l-projinfo2a>
.
Thèmes :
Notebooks
notebooks/td2_eco_rappels_1a
sérialisation, index, dataframe
Import/export de données dans un DataFrame, manipulation selon une logique SQL, utilité des index, lambda function, premiers graphiques, commandes magiques.
Notebooks
notebooks/_gs2a_dataframe
Modules
Notebooks
notebooks/_gs2a_dnumpy
Lectures
Modules
Notebooks
ext2a/sql_doc notebooks/_gs2a_sql
Plan
- Présenter 10 plotting libraries at PyData 2016.
- Grouper les étudiants par deux
- Considérer un jeu de données
- Chaque groupe essaye une librairie différente
- Insister sur la visualisation de gros jeu de données
Il existe de nombreuses librairies de visualisation réparties en deux grandes familles. La première produit des images (matplotlib, seaborn, networkx), la seconde produit des graphes animés à l'aide de Javascript (bokeh, bqplot). Les librairies les plus récentes implémentent les deux modes en cherchant toujours plus de simplicité. A ce sujet, il faut jeter un coup d'oeil à flexx. Elles explorent aussi la visualisation animée de gros jeux de données telle que datashader.
Notebook sur matplotlib
notebooks/_gs2a_visu
- Graphes classiques métriques pour des modèles de machine learning
- Graphes classiques métriques pour des modèles de machine learning - correction
Notebook sur Javascript
ext2a/javascript_doc
- Lire
Javascript et traitement de données <blog-js-data>
Modules
- matplotlib
- seaborn
- bokeh
- bqplot
l-visualisation
Notebooks
td1acenoncesession12rst
td1acorrectionsession12rst
- Evolution d'une population
- Evolution d'une population - correction
Formats de données
Système de coordonnées <blog-donnees-carroyees-2016>
(et données carroyées)- format de cartes shapefiles, topoJSON, geoJSON,
- Projections sphériques et conversion
- conversion de coordonnées en longitude / latitude
- librairies basemap, ...
- sources : DataMaps, Find Data
Modules
(à venir)
Modules
- TensorBoard : c'est un projet qui risque de prendre pas mal d'ampleur. Il sert à visualiser les résultats intermédiaires, à comparer, à voir les résultats d'un processus de machine learning, en particulier les réseaux de neurones profond. Même Keras s'y met. How to use tensorboard Embedding Projector ? Exemples TensorBoard: Embedding Visualization, An Encounter with Google’s TensorFlow, How to plot a ROC curve with Tensorflow and scikit-learn?
Construire un embedding consiste le plus souvent à construire une fonction qui convertit un entier, un graphe, un texte en un vecteur réel de dimension fixe exploitable par un modèle de machine learning. Cette partie s'intéresse à construire de meilleures variables que celles issues du problème initiale.
(à venir)
- PCA, Sparse PCA, Kernel PCA
- SOM
- LSH
Lectures
- PCA
- Johnson–Lindenstrauss lemma, Random projection, Concentration of measure, Experiments with Random Projection
- Compressed sensing and single-pixel cameras (wikipedia : Compressed Sensing)
- Locality-sensitive hashing, LSH Forest: Self-Tuning Indexes for Similarity Search
- Manifold learning
- Cartes de Kohonen
- Dynamic Self-Organising Map
- Fast Randomized SVD
- Neural Autoregressive Distribution Estimation
- How to Use t-SNE Effectively
Modules
- scikit-learn
- statsmodels
- fbpca : ACP
- prince : ACM
- Parametric-t-SNE
- datasketch (LSH)
- NearPy (LSH)
Animations
(à venir)
- Corrélation entre des variables catégorielles
Lectures
Tranformer les variables catégorielles et contrastes <encoding-categorie-id>
- Corrélations entre des variables catégorielles
- Exemple de traitement d'une variable catégorielle
- Enoncé d'examan autour des variables catégorielles et sa
corection <tdnote2017rs>
(à venir)
Lectures
- Learning Hierarchical Similarity Metrics
- From Word Embeddings To Document Distances
- Detecting Near-Duplicates for Web Crawling
- Deep metric learning using Triplet network
notebooks/_gs2a_clustering
(à venir)
- score silhouette
- clustering de variables catégorielles
Lectures
- A New Algorithm and Theory for Penalized Regression-based Clustering : méthode de sélection de variables pour des méthodes non supervisés de clustering, voir aussi Penalized Model-Based Clustering with Application to Variable Selection
- K-means
- Cartes de Kohonen
- Clustering by Passing Messages Between Data Points
- Map/Reduce Affinity Propagation Clustering Algorithm
- Parallel Hierarchical Affinity Propagation with MapReduce
- Cats & Co: Categorical Time Series Coclustering
- Comparing Python Clustering Algorithms
- Fast and Provably Good Seedings for k-Means
- Clustering with Same-Cluster Queries
- The K-Modes Algorithm for Clustering
- Clustering of Categorical variables
- Classification d'un ensemble de variables qualitatives
- Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup
- Online Clustering with Experts
- Kernel K-means and Spectral Clustering
- Scalable Density-Based Clustering with Quality Guarantees using Random Projections
- Clustering Via Decision Tree Construction (implémentation en python dimitrs/CLTree)
Modules
(à venir)
Lectures
- A Classification Framework for Anomaly Detection
- Security Analysis of Online Centroid Anomaly Detection
- Robust Random Cut Forest Based Anomaly Detection On Streams
- Network Traffic Decomposition for Anomaly Detection
- Network Volume Anomaly Detection and Identification in Large-scale Networks based on Online Time-structured Traffic Tensor Tracking
Vidéos
Modules
(à venir)
Lectures
- Graph Convolutional Networks
- Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
- Deep Convolutional Networks on Graph-Structured Data
Gaël est un des concepteurs de scikit-learn.
- notes de lectures (Gaël Varoquaux)
- machine learning et scikit-learn (tutoriels sur scikit-learn),
- Quelques extraits. Par définition les plus proches voisins ne font pas d'erreur sur la base d'apprentissage, l'apprentissage consiste à forcer le modèle à faire des erreurs. Overfitting et régularisation. Erreur L2 et pénalisation L1. RandomizedPCA, GridSearch, LassoCV. Choosing the right estimator.
Notebooks
notebooks/_gs2a_statdes notebooks/_gs2a_ml_base
Lectures
- A Visual Introduction to Machine Learning
- Quelques astuces pour faire du machine learning
- A Tour of Machine Learning Algorithms
- 12 Algorithms Every Data Scientist Should Know (2016/06)
- 10+2 Data Science Methods that Every Data Scientist Should Know in 2016 (2016/06)
- Complete Guide to Parameter Tuning in XGBoost (with codes in Python) (2016/08)
- XGBoost: A Scalable Tree Boosting System, Tianqi Chen, Carlos Guestrin
Modules
questions/some_ml
Notebooks
notebooks/_gs2a_ml
Lectures
Travailleur les features ou changer de modèle <mlfeaturesmodelrst>
Bien démarrer un projet de machine learning <l-debutermlprojet>
question_projet_2014
- MA 2823 Foundations of Machine Learning (Fall 2016)
- A Random Forest Guided Tour, Gérard Biau, Erwan Scornet
- Courbe ROC
- Random Rotation Ensembles
- A Unified Approach to Learning Task-Specific Bit Vector Representations for Fast Nearest Neighbor Search
Recherche
- XGBoost: A Scalable Tree Boosting System
- Classification of Imbalanced Data with a Geometric Digraph Family
- On the Influence of Momentum Acceleration on OnlineLearning
multilabel
Multilabel
- A Ranking-based KNN Approach for Multi-Label Classification
- Classification by Selecting Plausible Formal Concepts in a Concept Lattice
- Large-scale Multi-label Learning with Missing Labels
- Multiclass-Multilabel Classification with More Classes than Examples
Digressions
Métriques
Librairies
JMLR poste régulièrement des articles sur des librairies de machine learning open source.
- fastFM: A Library for Factorization Machines
Modules
- statsmodels
- xgboost
- mlxtend
- imbalanced-learn (la documentation est intéressante)
(à venir)
Lectures
- Learning to rank (software, datasets)
- Multiple-criteria decision analysis
- Data-driven Rank Breaking for Efficient Rank Aggregation
- BPR: Bayesian Personalized Ranking from Implicit Feedback (applicable également aux systèmes de recommandation)
Modules
- xgboost
- scikit-learn
- lightfm
- rankpy (standby)
- The Lemur Project - ranklib
- scikit-criteria (standby)
(à venir)
Lectures
Modules
Notebooks
notebooks/_gs2a_deep
Tutoriel
- Deep Learning course: lecture slides and lab notebooks
l-deep-learning-specials
.- Artificial Intelligence, Revealed (1) : article de blog et vidéos expliquant les différents concepts du deep learning
- colah's blog (2016/08) blog/cours sur le deep learning
- Building Autoencoders in Keras
- Tutoriels avec CNTK
Sites
Modèles pré-entraînés
- Places CNN, Pre-release of Places365-CNNs (deep learning)
- CNTK (sur github)
Lectures
- LightRNN: Memory and Computation-Efficient Recurrent Neural Networks
- Deep learning architecture diagrams
- Factorized Convolutional Neural Networks
- Deep Residual Learning for Image Recognition
- Deep Learning, Yoshua Bengio, Ian Goodfellow and Aaron Courville
- LeNet5
- mxnet
- Benchmarking State-of-the-Art Deep Learning Software Tools
- Wide & Deep Learning: Better Together with TensorFlow, Wide & Deep Learning for Recommender Systems
- To go deep or wide in learning?
- Three Classes of Deep Learning Architectures and Their Applications: A Tutorial Survey
- Tutorial: Learning Deep Architectures
- Deep Learning (wikipédia)
- Fast R-CNN (voir Object Detection using Fast R CNN)
- Evaluation of Deep Learning Toolkits (2015/12)
- Understanding Deep Learning Requires Rethinking Generalization
- Training Deep Nets with Sublinear Memory Cost
- On the importance of initialization and momentum in deep learning
- TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Deep Forest
Chiffres, Textes
- One weird trick for parallelizing convolutional neural networks
- ImageNet Classification with Deep Convolutional Neural Networks
- Very Deep Convolutional Networks for Large-Scale Image Recognition
- Multi-Digit Recognition Using A Space Displacement Neural Network
- Space Displacement Localization Neural Networks to locate origin points of handwritten text lines in historical documents
- Neural Network Architectures, Convolutional Neural Networks (CNNs / ConvNets)
- Transfer Learning
Plus théoriques
- Why Does Unsupervized Deep Learning Work? - A perspective from group theory
- Deep Learning of Representations: Looking Forward
- Why Does Unsupervised Pre-training Help Deep Learning?
Lectures deep text
- Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean,
- Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean,
- word2vec Parameter Learning Explained, Xin Rong,
- Tutorial on Auto-Encoders, Piotr Mirowski
Vus dans des conférences
- Fast R-CNN (dotAI)
- Mask R-CNN (dotAI)
- Modèle Tenserflow (modèle adaptés pour du transfer learning : ResNet, Inception) (dotAI)
- Domain-Adversarial Training of Neural Networks (dotAI)
Modules
- theano
- keras
- mxnet
- caffe (installation)
- climin (algorithme de back propagation)
- pytorch (Facebook)
- tensorflow (Google)
à suivre
- chainer
- platoon : multi-GPU pour theano
- scikit-theano
- Federated Learning: Collaborative Machine Learning without Centralized Training Data
Deep learning embarqué
ou apprentissage par renforcement
(année prochaine)
Lectures
- Deep Reinforcement Learning through Policy Optmization (vu dans Highlights of NIPS 2016: Adversarial learning, Meta-learning, and more)
- The Nuts and Bolts of Deep RL Research
- A Comprehensive Survey on Safe Reinforcement Learning
- RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research
- UCL Course on RL
- Reinforcement Learning Part I Reinforcement Learning Part II
- Strategic Attentive Writer for Learning Macro-Actions
- Temporal difference learning
(année prochaine)
Lectures
- Bandit theory, part I
- Bandit theory, part II
- Kernel-based methods for bandit convex optimization, part 1
- Kernel-based methods for bandit convex optimization, part 2
- Kernel-based methods for bandit convex optimization, part 3
- Learning to Interact (John Langford)
- Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization
- Stochastic Structured Prediction under Bandit Feedback
(année prochaine)
Notebooks
notebooks/_gs2a_bayes
Lectures
- A Bayesian Approximation Method for Online Ranking
- stan case studies
- Edward: A library for probabilistic modeling, inference, and criticism
Vidéo
Modules
(à venir)
Lectures
(à venir)
Lectures
- La régression quantile en pratique
- Extensions of the Markov chain marginal bootstrap
- Iteratively reweighted least squares
Modules
(à venir)
Lectures
- Learning to learn by gradient descent by gradient descent
- Importance Weighting Without Importance Weights: An Effcient Algorithm for Combinatorial Semi-Bandits
- Making Tree Ensembles Interpretable
- Understanding variable importances in forests of randomized trees
- Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
- Random Rotation Ensembles
- Wavelet decompositions of Random Forests - smoothness analysis, sparse approximation and applications
- "Why Should I Trust You?" Explaining the Predictions of Any Classifier (2016/06)
- Edward: A library for probabilistic modeling, inference, and criticism
- Strictly Proper Scoring Rules, Prediction, and Estimation
Modules
(à venir)
Lectures
Modules
(à venir)
Lectures
(à venir)
Lectures
- Learning Algorithms for Second-Price Auctions with Reserve
- Machine Learning in an Auction Environment
Notebooks
notebooks/_gs2a_timeseries
(à venir : modèles SETAR pour les séries non périodiques, modèles proies prédateurs)
Lectures
- Time series analysis with pandas
- Consistent Algorithms for Clustering Time Series
- Learning Time Series Detection Models from Temporally Imprecise Labels
- Time Series Prediction With Deep Learning in Keras
- Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras (voir LSTM)
- Time Series Classification and Clustering with Python
- Dynamic Time Warping
- Functional responses, functional covariates and the concurrent model
- Fast and Accurate Time Series Classification with WEASEL (text and timeseries)
- Forecasting at Scale (Facebook)
- SETAR : prédiction sur des modèles en apparence cycliques mais non périodiques (type proies-prédateurs, chaotiques), SETAR = Self-Exciting Threshold AutoRegressive
- Using predator-prey models on the Canadian lynx series, Inference for nonlinear dynamical systems
Modules
- statsmodels
- fbprophet (requires pystan)
- Rob J Hyndman software (disponible uniquement en R)
- influxdb (An Open-Source Time Series Database)
Modules
- pyalgotrade
- zipline
- alphalens
- pyfolio
- empyrical
- quantlib
- prophet (not updated anymore)
- bloomberg API
- ta-lib
(à venir)
Lectures
- Learning to learn by gradient descent by gradient descent
- Matching Networks for One Shot Learning
- Efficient and Robust Automated Machine Learning
- Learning Regular Sets from Queries and Counterexamples
Modules
(à venir)
Lectures
- Privacy Preserving Data Mining, Cynthia Dwork, Frank McSherry, concept de ϵ-differential privacy (version longue, Privacy Preserving Data Mining)
- Differentially Private Empirical Risk Minimization
- Preserving Privacy of Continuous High-dimensional Data with Minimax Filters
- Differentially Private Online Learning
- A Differentially Private Stochastic Gradient Descent Algorithm for Multiparty Classification
- Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo
- Machine Learning Classification over Encrypted Data
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy
- Compressed Sensing
Modules
- ciphermed : pas maintenu
(à venir)
Lectures
specials/nolabel
Notebooks
(à venir)
Lectures
Autoencoders - réduction de dimensionnalité
- Why Does Unsupervised Pre-training Help Deep Learning?
- Autoencoders
- Autoencoders, Unsupervised Learning, and Deep Architectures
- Generative Models, Adversarial Autoencoders
- Tutorial on Variational Autoencoders, Denoising Autoencoders (dA)
- Generative Adversarial Networks, NIPS 2016 Tutorial: Generative Adversarial Networks
- Adversarial Autoencoders
- Adversarial Autoencoders (with Pytorch)
- Marginalizing Stacked Linear Denoising Autoencoders
- What Regularized Auto-Encoders Learn from the Data-Generating Distribution
- Compressed sensing and single-pixel cameras
- Multi-Label Prediction via Compressed Sensing
No label, weak labels
- Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
- Unsupervised Supervised Learning II: Margin-Based Classification without Labels, Unsupervised Supervised Learning II: Margin-Based Classification Without Labels (longer version)
- Large-scale Multi-label Learning with Missing Labels
- Reducing Label Complexity by Learning From Bags
- Learning from Corrupted Binary Labels via Class-Probability Estimation
- Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
- Multitask Learning without Label Correspondences
- Training Highly Multiclass Classifiers
Online training
- Online Incremental Feature Learning with Denoising Autoencoders
- Fast Kernel Classifiers with Online and Active Learning, A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
- Multi Kernel Learning with Online-Batch Optimization
Transfer learning
- Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7
- ICML2011 Unsupervised and Transfer Learning Workshop
- Transfer Learning
- Deep Learning of Representations for Unsupervised and Transfer Learning
- Unsupervised and Transfer Learning Challenge: a Deep Learning Approach
- Transfer Learning by Kernel Meta-Learning
- A Survey on Transfer Learning
- Domain-Adversarial Training of Neural Networks
- Stability and Hypothesis Transfer Learning
- Transfer Learning Decision Forests for Gesture Recognition
- Learning Transferable Features with Deep Adaptation Networks
- Asymmetric Transfer Learning with Deep Gaussian Processes
- Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach
- Transfer Learning for Reinforcement Learning Domains: A Survey
- Unsupervised dimensionality reduction via gradient-based matrix factorization with two adaptive learning rates
Notebooks
notebooks/_gs2a_nlp
Lectures
- Système de complétion : la complétion est utilisée par tous les sites Internet pour aider les utilisateurs à saisir leur recherche. N'importe quel site commercial l'utiliser pour guider les utilisateurs plus rapidement vers le produit qu'ils recherchent.
- Text Understanding from Scratch, Xiang Zhang, Yann LeCun
- Text Generation With LSTM Recurrent Neural Networks in Python with Keras
- Dual Learning for Machine Translation
- Supervised Word Mover's Distance
- Probabilistic Context-Free Grammars (PCFGs)
- A Roundup of Recent Text Analytics and Vis Work
- A Joint Model for Entity Analysis: Coreference, Typing, and Linking
- Disfluency Detection with a Semi-Markov Model and Prosodic Features
- Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks
- Neural CRF Parsing
- Less Grammar More Features
- Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints
word2vec
- Towards a continuous modeling of natural language domains
- Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean, word2vec Parameter Learning Explained, Xin Rong, Tutorial on Auto-Encoders, Piotr Mirowski
Word embedding
- On word embeddings - Part 1
- On word embeddings - Part 2: Approximating the Softmax
- On word embeddings - Part 3: The secret ingredients of word2vec
- From Word Embeddings To Document Distances
Résumé
- Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion
- ROUGE: A Package for Automatic Evaluation of Summaries
Vidéos
Modules
- nltk
- gensim
- spacy
- Stanford CoreNLP, corenlpy
- python-rake : petit module pour extraire des mot-clés
- sumy : construction automatique d'un résumé d'un texte
- pyrouge : calcule de la métrique ROUGE
(à venir)
Lectures
- VGG Convolutional Neural Networks Practical
- Image-to-Image Translation with Conditional Adversarial Networks, Image to Image demo
- Image-to-Image Translation with Conditional Adversarial Networks
- How to Train a GAN? Tips and tricks to make GANs work
- Towards Principled Methods for Training Generative Adversarial Networks
- Instance Noise: A trick for stabilising GAN training
- Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
Modules
- VIGRA
- opencv
- hed (Holistically-Nested Edge Detection)
- bob.bio
- tous les modules de
l-deep-learning
- plat (Utilities for exploring generative latent spaces as described in the Sampling Generative Networks paper.)
Modèles pré-entraînés
- VGG16 model for Keras, VGG in TensorFlow, Very Deep Convolutional Networks for Large-Scale Visual Recognition
(à venir)
Modules
Lectures
- Probabilistic Linear Discriminant Analysis for Inferences About Identity, shrinkage
- Probabilistic Linear Discriminant Analysis for Acoustic Modelling
(année prochaine)
Lectures
- Basic models and questions in statistical network analysis
- Trinity: A Distributed Graph Engine on a Memory Cloud
- Dimensionality Reduction for Spectral Clustering
- Compressive Spectral Clustering
- Spectral Clustering on a Budget
- Partitioning Well-Clustered Graphs: Spectral Clustering Works!
- Bipartite Correlation Clustering: Maximizing Agreements
- Correlation Clustering and Biclustering with Locally Bounded Errors
- A Unified Framework for Model-based Clustering
- A Tensor Approach to Learning Mixed Membership Community Models
- Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel K-Means
- Learning Communities in the Presence of Errors
- Fast unfolding of communities in large networks
Notebooks
notebooks/_gs2a_eco_scraping notebooks/_gs2a_eco_api
Ressources
Modules
Notebooks
notebooks/_gs2a_eco_website
Lectures
Modules
Notebooks
notebooks/_gs2a_magic_commands
Notebooks
notebooks/_gs2a_no_sql_exo notebooks/_gs2a_no_sql_twitter notebooks/_gs2a_big_in_memory
Lectures
- Propriétés des base de données : ACID, relationnelle, transactionnelle
- Best practices, index et foreign key (importance des random access et accès séquentiel)
- Limites des structures relationnelles (données arborescentes, données hétérogènes)
- Base de données non relationnelles dont NoSQL
l-td25asynthese
- Un tools d'itertour, ou l'inverse
- Benchmark of Python JSON libraries
Bases de données no SQL
Modules
(à venir)
Modules
Notebooks
notebooks/_gs2a_langages notebooks/_gs1a_D_calcul_dicho_cython
Lectures
l-python_cplusplus
- sklearn-compiledtrees : création d'une implémentation C++ de la fonction de décision d'un arbre de décision entraîné avec scikit-learn
Vidéos
- Making your code faster: Cython and parallel processing in the Jupyter Notebook
- Modules*
- cython
- ctypes
- boost_python
- pybind11
La sérialisation est le fait de convertir n'importe quelle structure de données en un tableau d'octets, c'est indispensable pour la communication entre deux machines, deux processus.
Notebooks
notebooks/_gs2a_parallelisation notebooks/_gs2a_serialisation
Modules
Lectures
- Out-of-Core Dataframes in Python: Dask and OpenStreetMap (2015/12)
- Combining random forest models in scikit learn
- Better Python compressed persistence in joblib
td_2a_algo specials/nb_complet specials/algorithm_culture specials/problem_solved
Notebooks
notebooks/_gs2a_puzzle
Certains sont tirés de plusieurs sites dont Google Code Jam.
Lectures
- Profiling avec Python
- types de complexité : force brute, glouton, dynamique
l-algoculture
l-expose-explication
- Logique, modèles, calculs (INF 423)
- Notation de Landau
- Edmonds' Blossom Algorithm (github), Blossom5, Fast and Simple Algorithms for Weighted Perfect Matching
- La recherche mathématique en mots et en images (CNRS)
- The Traveling Salesperson Problem
- Google Interview University: This is my multi-month study plan for going from web developer (self-taught, no CS degree) to Google software engineer.
- Cache replacement policies
- Livres techniques en français
notebooks/_gs2a_streaming
Lectures
- Algorithme BJKST
- Streaming Algorithms
- Data Stream Algorithms
- Optimal streaming histograms
- Density Estimation Over Data Stream
- Data Streaming Algorithms
- Confidence Decision Trees via Online and Active Learning for Streaming (BIG) Data
- Approximation and Streaming Algorithms for Histogram Construction Problems
- State-of-the-art on clustering data streams
- Parallel Computing of Kernel Density Estimates with MPI
- Density Estimation with Adaptive Sparse Grids for Large Data Sets
Modules
Contrairement à ce qu'on pense, les datascientists sont plus prévisibles que les données.
machine learning
l-debutermlprojet
- Sous le capot de la boîte noire
- Quick samples on machine learning
- Cheat Sheets
- Gros volumes et sqllite3
- C'est quoi déjà le True False Positive ?
quoi d'autres ?
- Gerry Mandering (bidouillage de cartes électorales)
- Apprendre des synonymes
- Revue de compétition Kaggle
installation
- Anaconda +
conda update all
+pip install jyquickhelper
- XGBoost sous Windows
Livres sur le machine learning
- The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, Jerome Friedman
- Python for Data Analysis, Wes McKinney
- Building Machine Learning Systems with Python, Willi Richert, Luis Pedro Coelho
- Learning scikit-learn: Machine Learning in Python, Raúl Garreta, Guillermo Moncecchi
- Modeling Creativity: Case Studies in Python, Tom De Smedt
- Critical Mass: How One Thing Leads to Another, Philip Ball
- Bugra Akyildiz
- Deep Learning, Yoshua Bengio, Ian Goodfellow and Aaron Courville
- Artificial Intelligence: A Modern Approach, Stuart Russell, Peter Norvig (2016/08)
- Speech and Language Processing, Daniel Jurafsky and James H. Martin (2016/08), see also Draft chapters in progress
Livres sur les algorithmes
- Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein
- The Algorithm Design Manual, Steven S. Skiena
- Competitive Programming, Steven Halim
Livres sur la programmation
Liens sur la programmation
- Python Scientific Lecture Notes
- Introduction to matplotlib
- Introduction to Data Processing with Python
- Quelques idées de livres : Python for Data Scientists
- Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas
- Don't use Hadoop - your data isn't that big
- Prédire les épidémies avec Wikipedia, Le Monde
- FastML (blog sur le machine learning)
- Mathematical optimization: finding minima of functions
- you can take the derivative of a regular expression?! (2016/06)
- How to trick a neural network into thinking a panda is a vulture (2016/06)
- Matrix Factorization: A Simple Tutorial and Implementation in Python (2016/06)
- Top-down learning path: Machine Learning for Software Engineers
Tutoriels
- PyData Seattle 2015 Scikit-learn Tutorial (2015/12)
- Pythonic Perambulations (2015/12)
- Python Scripts posted on Kaggle (2016/02)
- Pandas cookbook (2016/06)
- Machine Learning & Deep Learning Tutorials (2016/06) : lien vers une liste assez longue de tutoriels, on y trouve aussi des cheat sheets comme Probability Cheatsheet
MOOC
- Machine Learning par Andrew Y. Ng (les chapitres X et XI de la semaine 6 aborde la construction d'un système de machine learning).
- Coursera Machine Learning
- Coursera Machine Algorithm
- CSE373 - Analysis of Algorithms - 2007 SBU
- CS109 Data Science (Harvard) (la liste des vidéos disponibles est en bas)
Autres cours, notebooks
- CS109 Data Science (Harvard) -TD -Talks
- Notes and assignments for Stanford CS class CS231n Convolutional Neural Networks for Visual Recognition
- Advanced Statistical Computing, Chris Fonnesbeck (Vanderbilt University)
- CS 188: Artificial Intelligence (Berkeley)
- IAPR: Teaching materials for machine learning
- machine learning et musique Audio Content Analysis, teachings
- ogrisel's notebook (2016/04)
- L'apprentissage profond, Yann LeCun au Collège de France (2016/06)
- MA 2823 Foundations of Machine Learning (Fall 2016) (2016/10)
Articles d'auteurs très connus
- Latent Dirichlet Allocation, David M. Blei, Andrew Y. Ng, Michael I. Jordan
- Analysis of a Random Forests Model, Gerard Biau
- Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression, Francis Bach
- Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising, Léon Bottou, Jonas Peter et Al.
- Tutorial on Practical Prediction Theory for Classification, John Langford
- Sparse Online Learning via Truncated Gradient, John Langford, Lihong Li, Tong Zhang
- Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference, Moontae Lee, David Mimno
- ABC model choice via random forests, Pierre Pudlo, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, Christian P. Robert
- Mondrian Forests: Efficient Online Random Forests, Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh
- Stochastic Gradient Tricks
- SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases, Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, Zoubin Ghahramani
- Learning from Partial Labels, Timothee Cour, Benjamin Sapp, Ben Taskar
- Word Alignment via Quadratic Assignment, Simon Lacoste-Julien, Ben Taskar, Dan Klein, Michael I. Jordan
- Contextual Bandit Learning with Predictable Rewards, Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, Robert E. Schapire
- Learning from Logged Implicit Exploration Data, Alex Strehl, John Langford, Lihong LiSham, M. Kakade
- The Metropolis-Hastings algorithm, Christian P. Robert
- From RankNet to LambdaRank to LambdaMART: An Overview, Christopher J.C. Burges
Compétition de code
- Google Hash Code, a lieu chaque année en deux tours, le second tour a lieu chez Google à Paris.
- Google Code Jam
- TopCoder
- UVa Online Judge
- Le problème des huit reines
- Projet Euler
Pour finir, Choosing the right estimator :
Librairies Python
- Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification
- Python extensions to do machine learning
- Related Projects (of machine learning) (2016/03)
Librairies de machine learning
- Awesome Machine Learning
- CNTK (2016/04)
- Keras
- pytorch
- scikit-learn
- TensorFlow
- theano
- Vowpal Wabbit
- xgboost
Vidéos