<a href="https://colab.research.google.com/github/rahiakela/mlops-research-and-practice/blob/main/MLOps-Specialization/course-3-machine-learning-modeling-pipelines-in-production/week-2-model-resource-management-techniques/02_algorithmic_dimensionality_reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Algorithmic Dimensionality Reduction 

Welcome, during this ungraded lab you are going to perform several algorithms that aim to reduce the dimensionality of data. This topic is very important because it is not uncommon that reduced models outperform the ones trained on the raw data because noise and redundant information are present in most datasets. This will also allow your models to train and make predictions faster, which might be really important depending on the problem you are working on. In particular you will:


1. Use Principal Component Analysis (**PCA**) to reduce the dimensionality of a dataset that classifies celestial bodies.
2. Use Single Value Decomposition (**SVD**) to create low level representations of images of handwritten digits.
3. Use Non-negative Matrix Factorization (**NMF**) to segment text into topics.

Let's get started!

##Setup

In [None]:
# General use imports
import os
import zipfile
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

from sklearn.datasets import load_digits
from sklearn.datasets import fetch_20newsgroups

from mpl_toolkits.mplot3d import Axes3D

In [None]:
# Download zip file
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip -o HTRU2.zip

# Unzip it
data_folder = os.path.join('.', 'data')
import zipfile
with zipfile.ZipFile('HTRU2.zip', 'r') as zip_ref:
    zip_ref.extractall(data_folder)
    
# Delete the downloaded zip file
os.remove('HTRU2.zip')

In [None]:
os.listdir(data_folder)

## Principal Components Analysis - PCA

This is an unsupervised algorithm that creates linear combinations of the original features. PCA is a widely used technique for dimension reduction since it is fast and easy to implement. PCA aims to keep as much variance as possible from the original data in a lower dimensional space. It finds the best axis to project the data so that the variance of the projections is maximized.

In the lecture you saw PCA applied to the Iris dataset. This dataset has been used extensively to showcase PCA so here you are going to do something different. You are going to use the [HTRU_2](https://archive.ics.uci.edu/ml/datasets/HTRU2) dataset which describes several celestial objects and the idea is to be able to classify if an object is a pulsar star or not.


Load the data into a dataframe for easier inspection: