Python pour un Data Scientist / Economiste

2A

Cours animé par : Xavier Dupré (ENSAE 1999)¹, Anne Muller (ENSAE 2012)², Elodie Royant (ENSAE 2008)³, Antoine Thabault (ENSAE 2012)⁴, Nicolas Rousset⁵, Antoine Ly (ENSAE 2015), Benjamin Donnot (ENSAE 2015), Gaël Varoquaux⁶.

Contributeurs : Jérémie Jakubowicz (ENSAE 2002)⁷, Gilles Drigout (ENSAE 2013)⁸

Ce cours s'étale sur 6 séances de cours/TD d'une durée de 4h. Les outils proposés sont en langage Python. Ils sont tous open source, pour la plupart disponibles sur GitHub et en développement actif. Python est récemment devenu une alternative plus que probante pour les scientifiques et comme c'est un langage générique, il est possible de gérer l'ensemble des traitements appliqués aux données, depuis le traitements des sources de données jusqu'à leur visualisation sans changer de langage.

Le cours est prévu pour des profils plutôt statistiques ou plutôt économiques . Ces images reviendront pour indiquer si les contenus s'adressent plutôt aux uns ou aux autres. La présentation ENSAE 2A - Données, Machine Learning et Programmation donne un aperçu des thèmes abordés.

feuille de route 2016 <l-feuille-de-route-2016-2A>
compétitions <td2A-competition-ml>
projet informatique <l-projinfo2a>.

Thèmes :

Rappels de programmation

Notebooks

notebooks/td2_eco_rappels_1a

sérialisation, index, dataframe

Matrices et DataFrames - numpy pandas SQL

Import/export de données dans un DataFrame, manipulation selon une logique SQL, utilité des index, lambda function, premiers graphiques, commandes magiques.

DataFrame

Notebooks

notebooks/_gs2a_dataframe

Modules

pandas

Array, Matrix

Notebooks

notebooks/_gs2a_dnumpy

Lectures

From Python to Numpy

Modules

numpy
scipy

SQL

Notebooks

ext2a/sql_doc notebooks/_gs2a_sql

Visualisation

Graphes

Plan

Présenter 10 plotting libraries at PyData 2016.
Grouper les étudiants par deux
Considérer un jeu de données
Chaque groupe essaye une librairie différente
Insister sur la visualisation de gros jeu de données

Il existe de nombreuses librairies de visualisation réparties en deux grandes familles. La première produit des images (matplotlib, seaborn, networkx), la seconde produit des graphes animés à l'aide de Javascript (bokeh, bqplot). Les librairies les plus récentes implémentent les deux modes en cherchant toujours plus de simplicité. A ce sujet, il faut jeter un coup d'oeil à flexx. Elles explorent aussi la visualisation animée de gros jeux de données telle que datashader.

Notebook sur matplotlib

notebooks/_gs2a_visu

Graphes classiques métriques pour des modèles de machine learning
Graphes classiques métriques pour des modèles de machine learning - correction

Notebook sur Javascript

ext2a/javascript_doc

Lire Javascript et traitement de données <blog-js-data>

Modules

matplotlib
seaborn
bokeh
bqplot
l-visualisation

Cartes

Notebooks

td1acenoncesession12rst
td1acorrectionsession12rst
Evolution d'une population
Evolution d'une population - correction

Formats de données

Système de coordonnées <blog-donnees-carroyees-2016> (et données carroyées)
format de cartes shapefiles, topoJSON, geoJSON,
Projections sphériques et conversion
conversion de coordonnées en longitude / latitude
librairies basemap, ...
sources : DataMaps, Find Data

Modules

basemap
cartopy
pyshp
shapely
pyproj
geopy

Visualiser pour comprendre

(à venir)

Modules

TensorBoard : c'est un projet qui risque de prendre pas mal d'ampleur. Il sert à visualiser les résultats intermédiaires, à comparer, à voir les résultats d'un processus de machine learning, en particulier les réseaux de neurones profond. Même Keras s'y met. How to use tensorboard Embedding Projector ? Exemples TensorBoard: Embedding Visualization, An Encounter with Google’s TensorFlow, How to plot a ROC curve with Tensorflow and scikit-learn?

Transformations des données, Embedding

Construire un embedding consiste le plus souvent à construire une fonction qui convertit un entier, un graphe, un texte en un vecteur réel de dimension fixe exploitable par un modèle de machine learning. Cette partie s'intéresse à construire de meilleures variables que celles issues du problème initiale.

Projections, Réduction des dimensions

(à venir)

PCA, Sparse PCA, Kernel PCA
SOM
LSH

Lectures

PCA
Johnson–Lindenstrauss lemma, Random projection, Concentration of measure, Experiments with Random Projection
Compressed sensing and single-pixel cameras (wikipedia : Compressed Sensing)
Locality-sensitive hashing, LSH Forest: Self-Tuning Indexes for Similarity Search
Manifold learning
Cartes de Kohonen
Dynamic Self-Organising Map
Fast Randomized SVD
Neural Autoregressive Distribution Estimation
How to Use t-SNE Effectively

Modules

scikit-learn
statsmodels
fbpca : ACP
prince : ACM
Parametric-t-SNE
datasketch (LSH)
NearPy (LSH)

Animations

How to Use t-SNE Effectively

Variables catégorielles

(à venir)

Corrélation entre des variables catégorielles

Lectures

Tranformer les variables catégorielles et contrastes <encoding-categorie-id>
Corrélations entre des variables catégorielles
Exemple de traitement d'une variable catégorielle
Enoncé d'examan autour des variables catégorielles et sa corection <tdnote2017rs>

Distances

(à venir)

Lectures

Learning Hierarchical Similarity Metrics
From Word Embeddings To Document Distances
Detecting Near-Duplicates for Web Crawling
Deep metric learning using Triplet network

Clustering

notebooks/_gs2a_clustering

(à venir)

score silhouette
clustering de variables catégorielles

Lectures

A New Algorithm and Theory for Penalized Regression-based Clustering : méthode de sélection de variables pour des méthodes non supervisés de clustering, voir aussi Penalized Model-Based Clustering with Application to Variable Selection
K-means
Cartes de Kohonen
Clustering by Passing Messages Between Data Points
Map/Reduce Affinity Propagation Clustering Algorithm
Parallel Hierarchical Affinity Propagation with MapReduce
Cats & Co: Categorical Time Series Coclustering
Comparing Python Clustering Algorithms
Fast and Provably Good Seedings for k-Means
Clustering with Same-Cluster Queries
The K-Modes Algorithm for Clustering
Clustering of Categorical variables
Classification d'un ensemble de variables qualitatives
Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup
Online Clustering with Experts
Kernel K-means and Spectral Clustering
Scalable Density-Based Clustering with Quality Guarantees using Random Projections
Clustering Via Decision Tree Construction (implémentation en python dimitrs/CLTree)

Modules

scikit-learn
hdbscan
pyclustering
pycluster

Détection d'anomalies

(à venir)

Lectures

A Classification Framework for Anomaly Detection
Security Analysis of Online Centroid Anomaly Detection
Robust Random Cut Forest Based Anomaly Detection On Streams
Network Traffic Decomposition for Anomaly Detection
Network Volume Anomaly Detection and Identification in Large-scale Networks based on Online Time-structured Traffic Tensor Tracking

Vidéos

Anomaly Detection vs. Supervised Learning

Modules

scikit-learn
pyculiarity
lsanomaly

Graphe et embedding

(à venir)

Lectures

Graph Convolutional Networks
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Deep Convolutional Networks on Graph-Structured Data

Machine Learning - Formalisation

Machine learning, cours de Gaël Varoquaux

Gaël est un des concepteurs de scikit-learn.

notes de lectures (Gaël Varoquaux)
machine learning et scikit-learn (tutoriels sur scikit-learn),
Quelques extraits. Par définition les plus proches voisins ne font pas d'erreur sur la base d'apprentissage, l'apprentissage consiste à forcer le modèle à faire des erreurs. Overfitting et régularisation. Erreur L2 et pénalisation L1. RandomizedPCA, GridSearch, LassoCV. Choosing the right estimator.

Notebooks

notebooks/_gs2a_statdes notebooks/_gs2a_ml_base

Lectures

A Visual Introduction to Machine Learning
Quelques astuces pour faire du machine learning
A Tour of Machine Learning Algorithms
12 Algorithms Every Data Scientist Should Know (2016/06)
10+2 Data Science Methods that Every Data Scientist Should Know in 2016 (2016/06)
Complete Guide to Parameter Tuning in XGBoost (with codes in Python) (2016/08)
XGBoost: A Scalable Tree Boosting System, Tianqi Chen, Carlos Guestrin

Modules

scikit-learn

Pratique du machine learning, problème de données

questions/some_ml

Notebooks

notebooks/_gs2a_ml

Lectures

Travailleur les features ou changer de modèle <mlfeaturesmodelrst>
Bien démarrer un projet de machine learning <l-debutermlprojet>
question_projet_2014
MA 2823 Foundations of Machine Learning (Fall 2016)
A Random Forest Guided Tour, Gérard Biau, Erwan Scornet
Courbe ROC
Random Rotation Ensembles
A Unified Approach to Learning Task-Specific Bit Vector Representations for Fast Nearest Neighbor Search

Recherche

XGBoost: A Scalable Tree Boosting System
Classification of Imbalanced Data with a Geometric Digraph Family
On the Influence of Momentum Acceleration on OnlineLearning

multilabel

Multilabel

A Ranking-based KNN Approach for Multi-Label Classification
Classification by Selecting Plausible Formal Concepts in a Concept Lattice
Large-scale Multi-label Learning with Missing Labels
Multiclass-Multilabel Classification with More Classes than Examples

Digressions

A Network That Learns Strassen Multiplication
Learning Theory for Distribution Regression

Métriques

Optimization of AMS using Weighted AUC optimized models

Librairies

JMLR poste régulièrement des articles sur des librairies de machine learning open source.

fastFM: A Library for Factorization Machines

Modules

statsmodels
xgboost
mlxtend
imbalanced-learn (la documentation est intéressante)

Ranking

(à venir)

Lectures

Learning to rank (software, datasets)
Multiple-criteria decision analysis
Data-driven Rank Breaking for Efficient Rank Aggregation
BPR: Bayesian Personalized Ranking from Implicit Feedback (applicable également aux systèmes de recommandation)

Modules

xgboost
scikit-learn
lightfm
rankpy (standby)
The Lemur Project - ranklib
scikit-criteria (standby)

Système de recommandation

(à venir)

Lectures

Recommendations in Keras using triplet loss

Modules

scikit-learn

Deep Learning

Notebooks

notebooks/_gs2a_deep

Tutoriel

Deep Learning course: lecture slides and lab notebooks
l-deep-learning-specials.
Artificial Intelligence, Revealed (1) : article de blog et vidéos expliquant les différents concepts du deep learning
colah's blog (2016/08) blog/cours sur le deep learning
Building Autoencoders in Keras
Tutoriels avec CNTK

Sites

Tinker With a Neural Network Right Here in Your Browser
ConvNetJS
Databricks / Deep Learning

Modèles pré-entraînés

Places CNN, Pre-release of Places365-CNNs (deep learning)
CNTK (sur github)

Lectures

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks
Deep learning architecture diagrams
Factorized Convolutional Neural Networks
Deep Residual Learning for Image Recognition
Deep Learning, Yoshua Bengio, Ian Goodfellow and Aaron Courville
LeNet5
mxnet
Benchmarking State-of-the-Art Deep Learning Software Tools
Wide & Deep Learning: Better Together with TensorFlow, Wide & Deep Learning for Recommender Systems
To go deep or wide in learning?
Three Classes of Deep Learning Architectures and Their Applications: A Tutorial Survey
Tutorial: Learning Deep Architectures
Deep Learning (wikipédia)
Fast R-CNN (voir Object Detection using Fast R CNN)
Evaluation of Deep Learning Toolkits (2015/12)
Understanding Deep Learning Requires Rethinking Generalization
Training Deep Nets with Sublinear Memory Cost
On the importance of initialization and momentum in deep learning
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Deep Forest

Deep Forest: Towards An Alternative to Deep Neural Networks

Chiffres, Textes

One weird trick for parallelizing convolutional neural networks
ImageNet Classification with Deep Convolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
Multi-Digit Recognition Using A Space Displacement Neural Network
Space Displacement Localization Neural Networks to locate origin points of handwritten text lines in historical documents
Neural Network Architectures, Convolutional Neural Networks (CNNs / ConvNets)
Transfer Learning

Plus théoriques

Why Does Unsupervized Deep Learning Work? - A perspective from group theory
Deep Learning of Representations: Looking Forward
Why Does Unsupervised Pre-training Help Deep Learning?

Lectures deep text

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean,
Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean,
word2vec Parameter Learning Explained, Xin Rong,
Tutorial on Auto-Encoders, Piotr Mirowski

Vus dans des conférences

Fast R-CNN (dotAI)
Mask R-CNN (dotAI)
Modèle Tenserflow (modèle adaptés pour du transfer learning : ResNet, Inception) (dotAI)
Domain-Adversarial Training of Neural Networks (dotAI)

Modules

theano
keras
mxnet
caffe (installation)
climin (algorithme de back propagation)
pytorch (Facebook)
tensorflow (Google)

à suivre

chainer
platoon : multi-GPU pour theano
scikit-theano
Federated Learning: Collaborative Machine Learning without Centralized Training Data

Deep learning embarqué

TensorFlow sur Android
TensorFlow sur RasberryPI

Reinforcement Learning

ou apprentissage par renforcement

(année prochaine)

Lectures

Deep Reinforcement Learning through Policy Optmization (vu dans Highlights of NIPS 2016: Adversarial learning, Meta-learning, and more)
The Nuts and Bolts of Deep RL Research
A Comprehensive Survey on Safe Reinforcement Learning
RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research
UCL Course on RL
Reinforcement Learning Part I Reinforcement Learning Part II
Strategic Attentive Writer for Learning Macro-Actions
Temporal difference learning

Bandits

(année prochaine)

Lectures

Bandit theory, part I
Bandit theory, part II
Kernel-based methods for bandit convex optimization, part 1
Kernel-based methods for bandit convex optimization, part 2
Kernel-based methods for bandit convex optimization, part 3
Learning to Interact (John Langford)
Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization
Stochastic Structured Prediction under Bandit Feedback

Modèles bayésiens

(année prochaine)

Notebooks

notebooks/_gs2a_bayes

Lectures

A Bayesian Approximation Method for Online Ranking
stan case studies
Edward: A library for probabilistic modeling, inference, and criticism

Vidéo

Variational Inference in Python
Bayesian Network Modeling using R and Python

Modules

edward
PyMC3
bayespy
kabuki

Factorization Machines

(à venir)

Lectures

Factorization Machines with libFM (2016/09)
Stochastic Subsampling for Factorizing Huge Matrices

Machine Learning Avancé

Régression quantile

(à venir)

Lectures

La régression quantile en pratique
Extensions of the Markov chain marginal bootstrap
Iteratively reweighted least squares

Modules

statsmodels

Interprétabilité des modèles

(à venir)

Lectures

Learning to learn by gradient descent by gradient descent
Importance Weighting Without Importance Weights: An Effcient Algorithm for Combinatorial Semi-Bandits
Making Tree Ensembles Interpretable
Understanding variable importances in forests of randomized trees
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
Random Rotation Ensembles
Wavelet decompositions of Random Forests - smoothness analysis, sparse approximation and applications
"Why Should I Trust You?" Explaining the Predictions of Any Classifier (2016/06)
Edward: A library for probabilistic modeling, inference, and criticism
Strictly Proper Scoring Rules, Prediction, and Estimation

Modules

edward

Optimisation des hyperparamètres

(à venir)

Lectures

Algorithms for Hyper-Parameter Optimization

Modules

scikit-learn
hyperopt

Online training

(à venir)

Lectures

Fast Rates in Statistical and Online Learning

Modèles avec dépendances dans le temps

(à venir)

Lectures

Learning Algorithms for Second-Price Auctions with Reserve
Machine Learning in an Auction Environment

Timeseries - Séries temporelles

Notebooks

notebooks/_gs2a_timeseries

(à venir : modèles SETAR pour les séries non périodiques, modèles proies prédateurs)

Lectures

Time series analysis with pandas
Consistent Algorithms for Clustering Time Series
Learning Time Series Detection Models from Temporally Imprecise Labels
Time Series Prediction With Deep Learning in Keras
Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras (voir LSTM)
Time Series Classification and Clustering with Python
Dynamic Time Warping
Functional responses, functional covariates and the concurrent model
Fast and Accurate Time Series Classification with WEASEL (text and timeseries)
Forecasting at Scale (Facebook)
SETAR : prédiction sur des modèles en apparence cycliques mais non périodiques (type proies-prédateurs, chaotiques), SETAR = Self-Exciting Threshold AutoRegressive
Using predator-prey models on the Canadian lynx series, Inference for nonlinear dynamical systems

Modules

statsmodels
fbprophet (requires pystan)
Rob J Hyndman software (disponible uniquement en R)
influxdb (An Open-Source Time Series Database)

Finance

Modules

pyalgotrade
zipline
alphalens
pyfolio
empyrical
quantlib
prophet (not updated anymore)
bloomberg API
ta-lib

Auto-Learning

(à venir)

Angluin Algorithm

Lectures

Learning to learn by gradient descent by gradient descent
Matching Networks for One Shot Learning
Efficient and Robust Automated Machine Learning
Learning Regular Sets from Queries and Counterexamples

Modules

REP
TPOT
auto-sklearn

Machine Learning sur des données cryptées

(à venir)

Lectures

Privacy Preserving Data Mining, Cynthia Dwork, Frank McSherry, concept de ϵ-differential privacy (version longue, Privacy Preserving Data Mining)
Differentially Private Empirical Risk Minimization
Preserving Privacy of Continuous High-dimensional Data with Minimax Filters
Differentially Private Online Learning
A Differentially Private Stochastic Gradient Descent Algorithm for Multiparty Classification
Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo
Machine Learning Classification over Encrypted Data
CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy
Compressed Sensing

Modules

ciphermed : pas maintenu

Prédire une distribution

(à venir)

Lectures

Learning with a Wasserstein Loss

Apprentissage sans labels

specials/nolabel

Notebooks

(à venir)

Lectures

Autoencoders - réduction de dimensionnalité

Why Does Unsupervised Pre-training Help Deep Learning?
Autoencoders
Autoencoders, Unsupervised Learning, and Deep Architectures
Generative Models, Adversarial Autoencoders
Tutorial on Variational Autoencoders, Denoising Autoencoders (dA)
Generative Adversarial Networks, NIPS 2016 Tutorial: Generative Adversarial Networks
Adversarial Autoencoders
Adversarial Autoencoders (with Pytorch)
Marginalizing Stacked Linear Denoising Autoencoders
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Compressed sensing and single-pixel cameras
Multi-Label Prediction via Compressed Sensing

No label, weak labels

Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Unsupervised Supervised Learning II: Margin-Based Classification without Labels, Unsupervised Supervised Learning II: Margin-Based Classification Without Labels (longer version)
Large-scale Multi-label Learning with Missing Labels
Reducing Label Complexity by Learning From Bags
Learning from Corrupted Binary Labels via Class-Probability Estimation
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Multitask Learning without Label Correspondences
Training Highly Multiclass Classifiers

Online training

Online Incremental Feature Learning with Denoising Autoencoders
Fast Kernel Classifiers with Online and Active Learning, A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Multi Kernel Learning with Online-Batch Optimization

Transfer learning

Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7
ICML2011 Unsupervised and Transfer Learning Workshop
Transfer Learning
Deep Learning of Representations for Unsupervised and Transfer Learning
Unsupervised and Transfer Learning Challenge: a Deep Learning Approach
Transfer Learning by Kernel Meta-Learning
A Survey on Transfer Learning
Domain-Adversarial Training of Neural Networks
Stability and Hypothesis Transfer Learning
Transfer Learning Decision Forests for Gesture Recognition
Learning Transferable Features with Deep Adaptation Networks
Asymmetric Transfer Learning with Deep Gaussian Processes
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach
Transfer Learning for Reinforcement Learning Domains: A Survey
Unsupervised dimensionality reduction via gradient-based matrix factorization with two adaptive learning rates

NLP - Image - Réseaux

Traitement du langage

Notebooks

notebooks/_gs2a_nlp

Lectures

Système de complétion : la complétion est utilisée par tous les sites Internet pour aider les utilisateurs à saisir leur recherche. N'importe quel site commercial l'utiliser pour guider les utilisateurs plus rapidement vers le produit qu'ils recherchent.
Text Understanding from Scratch, Xiang Zhang, Yann LeCun
Text Generation With LSTM Recurrent Neural Networks in Python with Keras
Dual Learning for Machine Translation
Supervised Word Mover's Distance
Probabilistic Context-Free Grammars (PCFGs)
A Roundup of Recent Text Analytics and Vis Work
A Joint Model for Entity Analysis: Coreference, Typing, and Linking
Disfluency Detection with a Semi-Markov Model and Prosodic Features
Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks
Neural CRF Parsing
Less Grammar More Features
Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints

word2vec

Towards a continuous modeling of natural language domains
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Distributed Representations of Words and Phrases and their Compositionality, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean, word2vec Parameter Learning Explained, Xin Rong, Tutorial on Auto-Encoders, Piotr Mirowski

Word embedding

On word embeddings - Part 1
On word embeddings - Part 2: Approximating the Softmax
On word embeddings - Part 3: The secret ingredients of word2vec
From Word Embeddings To Document Distances

Résumé

Beyond SumBasic: Task-Focused Summarization with Sentence Simplification and Lexical Expansion
ROUGE: A Package for Automatic Evaluation of Summaries

Vidéos

Modern NLP in Python

Modules

nltk
gensim
spacy
Stanford CoreNLP, corenlpy
python-rake : petit module pour extraire des mot-clés
sumy : construction automatique d'un résumé d'un texte
pyrouge : calcule de la métrique ROUGE

Images

(à venir)

Lectures

VGG Convolutional Neural Networks Practical
Image-to-Image Translation with Conditional Adversarial Networks, Image to Image demo
Image-to-Image Translation with Conditional Adversarial Networks
How to Train a GAN? Tips and tricks to make GANs work
Towards Principled Methods for Training Generative Adversarial Networks
Instance Noise: A trick for stabilising GAN training
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Modules

VIGRA
opencv
hed (Holistically-Nested Edge Detection)
bob.bio
tous les modules de l-deep-learning
plat (Utilities for exploring generative latent spaces as described in the Sampling Generative Networks paper.)

Modèles pré-entraînés

VGG16 model for Keras, VGG in TensorFlow, Very Deep Convolutional Networks for Large-Scale Visual Recognition

Visage, paroles

(à venir)

Modules

bob.bio
kaldi (reconnaissance de la parole)

Lectures

Probabilistic Linear Discriminant Analysis for Inferences About Identity, shrinkage
Probabilistic Linear Discriminant Analysis for Acoustic Modelling

Graphes et réseaux

(année prochaine)

Lectures

Basic models and questions in statistical network analysis
Trinity: A Distributed Graph Engine on a Memory Cloud
Dimensionality Reduction for Spectral Clustering
Compressive Spectral Clustering
Spectral Clustering on a Budget
Partitioning Well-Clustered Graphs: Spectral Clustering Works!
Bipartite Correlation Clustering: Maximizing Agreements
Correlation Clustering and Biclustering with Locally Bounded Errors
A Unified Framework for Model-based Clustering
A Tensor Approach to Learning Mixed Membership Community Models
Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel K-Means
Learning Communities in the Presence of Errors
Fast unfolding of communities in large networks

Techniques de programmation et algorithmes

Webscrapping et API

Notebooks

notebooks/_gs2a_eco_scraping notebooks/_gs2a_eco_api

Ressources

API de geocoding
adresse.data.gouv.fr

Modules

beautifulsoup
ghost.py
selenium
scrapy
scrapoxy, python api

Site web

Notebooks

notebooks/_gs2a_eco_website

Lectures

Python's Web Framework Benchmarks

Modules

bottle
django
falcon
Flask
sanic + uvloop

Jupyter et les commandes magiques

Notebooks

notebooks/_gs2a_magic_commands

Big data sans cluster, données non structurées

présentation données structurées

Notebooks

notebooks/_gs2a_no_sql_exo notebooks/_gs2a_no_sql_twitter notebooks/_gs2a_big_in_memory

Lectures

Propriétés des base de données : ACID, relationnelle, transactionnelle
Best practices, index et foreign key (importance des random access et accès séquentiel)
Limites des structures relationnelles (données arborescentes, données hétérogènes)
Base de données non relationnelles dont NoSQL
l-td25asynthese
Un tools d'itertour, ou l'inverse
Benchmark of Python JSON libraries

Bases de données no SQL

MongoDB
rethinkdb (python : rethinkdb)

Modules

dask
cytoolz

Tensor, tableaux multidimensionnel

(à venir)

Modules

xarray
xtensor-array
cubes

C++, R

Notebooks

notebooks/_gs2a_langages notebooks/_gs1a_D_calcul_dicho_cython

Lectures

l-python_cplusplus
sklearn-compiledtrees : création d'une implémentation C++ de la fonction de décision d'un arbre de décision entraîné avec scikit-learn

Vidéos

Making your code faster: Cython and parallel processing in the Jupyter Notebook
Modules*
cython
ctypes
boost_python
pybind11

Parallélisation, sérialisation

La sérialisation est le fait de convertir n'importe quelle structure de données en un tableau d'octets, c'est indispensable pour la communication entre deux machines, deux processus.

Notebooks

notebooks/_gs2a_parallelisation notebooks/_gs2a_serialisation

Modules

dask
cytoolz
joblib

Lectures

Out-of-Core Dataframes in Python: Dask and OpenStreetMap (2015/12)
Combining random forest models in scikit learn
Better Python compressed persistence in joblib

Puzzles algorithmiques

td_2a_algo specials/nb_complet specials/algorithm_culture specials/problem_solved

Notebooks

notebooks/_gs2a_puzzle

Certains sont tirés de plusieurs sites dont Google Code Jam.

Lectures

Profiling avec Python
types de complexité : force brute, glouton, dynamique
l-algoculture
l-expose-explication
Logique, modèles, calculs (INF 423)
Notation de Landau
Edmonds' Blossom Algorithm (github), Blossom5, Fast and Simple Algorithms for Weighted Perfect Matching
La recherche mathématique en mots et en images (CNRS)
The Traveling Salesperson Problem
Google Interview University: This is my multi-month study plan for going from web developer (self-taught, no CS degree) to Google software engineer.
Cache replacement policies
Livres techniques en français

Streaming algorithms

Répartir train / test en streaming

notebooks/_gs2a_streaming

Lectures

Algorithme BJKST
Streaming Algorithms
Data Stream Algorithms
Optimal streaming histograms
Density Estimation Over Data Stream
Data Streaming Algorithms
Confidence Decision Trees via Online and Active Learning for Streaming (BIG) Data
Approximation and Streaming Algorithms for Histogram Construction Problems
State-of-the-art on clustering data streams
Parallel Computing of Kernel Density Estimates with MPI
Density Estimation with Adaptive Sparse Grids for Large Data Sets

Modules

StreamLib

Data Scientist en liberté

Contrairement à ce qu'on pense, les datascientists sont plus prévisibles que les données.

machine learning

l-debutermlprojet
Sous le capot de la boîte noire
Quick samples on machine learning
Cheat Sheets
Gros volumes et sqllite3
C'est quoi déjà le True False Positive ?

quoi d'autres ?

Gerry Mandering (bidouillage de cartes électorales)
Apprendre des synonymes
Revue de compétition Kaggle

installation

Anaconda + conda update all + pip install jyquickhelper
XGBoost sous Windows

Bibliographie

Livres sur le machine learning

The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, Jerome Friedman
Python for Data Analysis, Wes McKinney
Building Machine Learning Systems with Python, Willi Richert, Luis Pedro Coelho
Learning scikit-learn: Machine Learning in Python, Raúl Garreta, Guillermo Moncecchi
Modeling Creativity: Case Studies in Python, Tom De Smedt
Critical Mass: How One Thing Leads to Another, Philip Ball
Bugra Akyildiz
Deep Learning, Yoshua Bengio, Ian Goodfellow and Aaron Courville
Artificial Intelligence: A Modern Approach, Stuart Russell, Peter Norvig (2016/08)
Speech and Language Processing, Daniel Jurafsky and James H. Martin (2016/08), see also Draft chapters in progress

Livres sur les algorithmes

Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein
The Algorithm Design Manual, Steven S. Skiena
Competitive Programming, Steven Halim

Livres sur la programmation

High Performance Python, Micha Gorelick, Ian Ozsvald.

Le livre est très bien conçu et les exemples sont très clairs. Si vous souhaitez accélérer un programme Python en utilisant le multithreading, OpenMP, Numba, Cython PyPy, ou CPython, je recommande d'y jeter un coup d'oeil d'abord.

Liens sur la programmation

Python Scientific Lecture Notes
Introduction to matplotlib
Introduction to Data Processing with Python
Quelques idées de livres : Python for Data Scientists
Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas
Don't use Hadoop - your data isn't that big
Prédire les épidémies avec Wikipedia, Le Monde
FastML (blog sur le machine learning)
Mathematical optimization: finding minima of functions
you can take the derivative of a regular expression?! (2016/06)
How to trick a neural network into thinking a panda is a vulture (2016/06)
Matrix Factorization: A Simple Tutorial and Implementation in Python (2016/06)
Top-down learning path: Machine Learning for Software Engineers

Tutoriels

PyData Seattle 2015 Scikit-learn Tutorial (2015/12)
Pythonic Perambulations (2015/12)
Python Scripts posted on Kaggle (2016/02)
Pandas cookbook (2016/06)
Machine Learning & Deep Learning Tutorials (2016/06) : lien vers une liste assez longue de tutoriels, on y trouve aussi des cheat sheets comme Probability Cheatsheet

MOOC

Machine Learning par Andrew Y. Ng (les chapitres X et XI de la semaine 6 aborde la construction d'un système de machine learning).
Coursera Machine Learning
Coursera Machine Algorithm
CSE373 - Analysis of Algorithms - 2007 SBU
CS109 Data Science (Harvard) (la liste des vidéos disponibles est en bas)

Autres cours, notebooks

CS109 Data Science (Harvard) -TD -Talks
Notes and assignments for Stanford CS class CS231n Convolutional Neural Networks for Visual Recognition
Advanced Statistical Computing, Chris Fonnesbeck (Vanderbilt University)
CS 188: Artificial Intelligence (Berkeley)
IAPR: Teaching materials for machine learning
machine learning et musique Audio Content Analysis, teachings
ogrisel's notebook (2016/04)
L'apprentissage profond, Yann LeCun au Collège de France (2016/06)
MA 2823 Foundations of Machine Learning (Fall 2016) (2016/10)

Articles d'auteurs très connus

Latent Dirichlet Allocation, David M. Blei, Andrew Y. Ng, Michael I. Jordan
Analysis of a Random Forests Model, Gerard Biau
Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression, Francis Bach
Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising, Léon Bottou, Jonas Peter et Al.
Tutorial on Practical Prediction Theory for Classification, John Langford
Sparse Online Learning via Truncated Gradient, John Langford, Lihong Li, Tong Zhang
Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference, Moontae Lee, David Mimno
ABC model choice via random forests, Pierre Pudlo, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, Christian P. Robert
Mondrian Forests: Efficient Online Random Forests, Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh
Stochastic Gradient Tricks
SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases, Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, Zoubin Ghahramani
Learning from Partial Labels, Timothee Cour, Benjamin Sapp, Ben Taskar
Word Alignment via Quadratic Assignment, Simon Lacoste-Julien, Ben Taskar, Dan Klein, Michael I. Jordan
Contextual Bandit Learning with Predictable Rewards, Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, Robert E. Schapire
Learning from Logged Implicit Exploration Data, Alex Strehl, John Langford, Lihong LiSham, M. Kakade
The Metropolis-Hastings algorithm, Christian P. Robert
From RankNet to LambdaRank to LambdaMART: An Overview, Christopher J.C. Burges

Compétition de code

Google Hash Code, a lieu chaque année en deux tours, le second tour a lieu chez Google à Paris.
Google Code Jam
TopCoder
UVa Online Judge
Le problème des huit reines
Projet Euler

Pour finir, Choosing the right estimator :

Librairies Python

Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification
Python extensions to do machine learning
Related Projects (of machine learning) (2016/03)

Librairies de machine learning

Awesome Machine Learning
CNTK (2016/04)
Keras
pytorch
scikit-learn
TensorFlow
theano
Vowpal Wabbit
xgboost

Vidéos

Beyond Bag of Words A Practitioner’s Guide to Advanced NLP
Building Continuous Learning Systems

Contributeur, encadrant et coordinateur du cours.↩
Contributeur, encadrant et coordinateur du cours.↩
Contributeur, encadrant.↩
Contributeur, encadrant.↩
Contributeur, encadrant.↩
Contributeur, encadrant.↩
Contributeur, encadrant des premiers jours (2014-2016).↩
Contributeur.↩

Files

td_2a.rst

Latest commit

History

td_2a.rst

File metadata and controls

Python pour un Data Scientist / Economiste

Rappels de programmation

Matrices et DataFrames - numpy pandas SQL

DataFrame

Array, Matrix

SQL

Visualisation

Graphes

Cartes

Visualiser pour comprendre

Transformations des données, Embedding

Projections, Réduction des dimensions

Variables catégorielles

Distances

Clustering

Détection d'anomalies

Graphe et embedding

Machine Learning - Formalisation

Machine learning, cours de Gaël Varoquaux

Pratique du machine learning, problème de données

Ranking

Système de recommandation

Deep Learning

Reinforcement Learning

Bandits

Modèles bayésiens

Factorization Machines

Machine Learning Avancé

Régression quantile

Interprétabilité des modèles

Optimisation des hyperparamètres

Online training

Modèles avec dépendances dans le temps

Timeseries - Séries temporelles

Finance

Auto-Learning

Machine Learning sur des données cryptées

Prédire une distribution

Apprentissage sans labels

NLP - Image - Réseaux

Traitement du langage

Images

Visage, paroles

Graphes et réseaux

Techniques de programmation et algorithmes

Webscrapping et API

Site web

Jupyter et les commandes magiques

Big data sans cluster, données non structurées

Tensor, tableaux multidimensionnel

C++, R

Parallélisation, sérialisation

Puzzles algorithmiques

Streaming algorithms

Data Scientist en liberté

Bibliographie