In [None]:
- Data preprocessing
    - Scaling, normalization, imputation
    - Load features into pandas dataframe
    - Pipeline (`sklearn.pipeline`)
        - Chains multiple preprocessing and machine learning steps together, each output is the input of the next step `Pipeline(steps=[('meanigful string that refers to this step', Technique), ()]`.
    - StandardScaler (`sklearn.preprocessing`)
        - Used for numeric, discrete, continuous.
        - Technique that scales data so that each feature has a mean of 0 and standard deviation of 1.
        - Rescales each feature to have the same scale, making it easier to compare relative importance of different features.
    - OneHotEncoding (`sklearn.preprocessing`)
        - Used to convert categorical (ordinal?) features into numerical features that can be read by algorithms.
        - Creates binary for each category in original feature.
    - Column Transformer (`sklearn.compose`)
        - Apply preprocessing to a column or sub-columns to allow for transformations.
        - `ColumnTransformer(transformers=[('name', technique, ['col_el_1', 'col_el_2']`.
    - SelectKBest, f_regression (`sklearn.feature_selection`)
        - Selects top k features based on scoring function.
        - `selector = SelectKBest(score_func=regression, k=10)`
        - `selector.fitTransform(X, y)`
        - k selects top 10 features.
        - Feature Selection
            - Using all features can lead to overfitting and reduced performance.
            - SelectKBest
                - Helps you focus on the most important patterns in the data.
                - Reduces overfitting, makes model less complex.
                - Fits noise and signal of data (see kaggle notes here?).
                - Faster training times.
        - f_regression
            - Based on F-test, returns F-score and p-value (see Stat Notes).

In [None]:
- Feature extraction
    - Principal Component Analysis, Linear Discriminant Analysis, Non-negative Matrix Factorization. Used to reduce dimensionality or extract meaningful features.
    - CountVectorizer (`sklearn.feature_extraction.text`)
        - `.fit_transform(array_of_strings)`
        - Outputs a binary representation based on whether the existing list of words in the overall matrix appears in that element (1: appears, 0: doesn’t appear)
        - Corp(us/a): large and structured set of texts, recording, or other form of linguistic data. Passed to vectorizer for tokenization and counting.
    - PCA (`sklearn.decomposition`)
        - Reduces dimensionality by finding most important features (principal components) that explain the majority of variance in the data.
        - Transforms original dataset into a new coordinate system with the axes in the directions in the data with highest variance.
        - First component with highest variance and subsequent components perpendicular to previous ones explaining remaining variance.
        - Uses Singular Value Decomposition to decompose original data matrix into eigenvectors (direction of maximum variance) and eigenvalues.
            - Eigenvector: when multiplied by a matrix yields scaled version of itself.
            - Eigenvalue: scaling factor of eigenvector.
        - Can be used for: dimensionality reduction, visualization, noise reduction
        - transform looks like 1000 rows 50 cols to 1000 rows 10 cols.
        - Values in output matrix are also scaled to mean 0 and SD 1.
    - LDA
        - Finds linear combinations that best separates different classes of data.
        - Maximizes ratio of between-class variance to within-class variance.
        - Results in new set of variables used to classify new data points.
        - Used in pattern recognition, image processing, and feature extraction.
        - Topic modeling.
    - NMF
        - Decomposes matrix into two non-negative matrices.
        - Finds basis vectors (cols of matrix 1) and coefficients (rows of matrix 2) that can be combined to create the original matrix.
        - Useful in text analysis.
        - Topic modeling and interpretability.

In [None]:
- Model selection
    - [ML Kaggle]
    - Cross-validation, grid search, random search.
    - Used for prediction or classifying new data points.

In [None]:
- Model training
    - Linear regression, logistic regression, decision trees, random forests, support vector machines, neural networks.

In [None]:
- Evaluation
    - …on performance of models such as accuracy, prediction, recall, F1 score, ROC curves, confusion matrices.
    - Used top measure effectiveness of a model and how it compares.
    - confusion_matrix (`sklearn.metrics`)
        - Used to describe the performance of a classification model on test data on which the true values are known.
        - Basic table that deals with binary classification compares actual and predicted positives and negatives labeled as true or false