<a href="https://colab.research.google.com/github/yavuzuzun/machine-learning-practice/blob/main/Comparison_of_Dimensionality_Reduction_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Most of the analysis techniques become effective just after the right choice of dimensionality reduction. It is important to evaluate the strong and weak sides of different dimensionality reduction algorithms in a comperative manner.

## PCA

Best for numerical data. Some assumptions for expecting a succesful dimensionality reduction with PCA are as follows:

Numeric Data: PCA is designed for numeric data, such as real numbers or integers. It works well with continuous variables, making it suitable for applications like image processing, signal processing, and numerical datasets.

Linear Relationships: PCA assumes that the data exhibits linear relationships between variables. It seeks to find orthogonal (uncorrelated) linear combinations of the original variables, known as principal components. If the relationships in the data are nonlinear, other dimensionality reduction methods like t-SNE or Isomap may be more appropriate.

Homoscedasticity: PCA assumes that the variance of the data is roughly constant across all dimensions. In other words, the spread of data points in one direction should not be significantly larger or smaller than in another direction. If the data violates this assumption, you might consider other dimensionality reduction methods.

Large Feature Set: PCA is particularly useful when dealing with datasets with a large number of features (high-dimensional data). It helps in reducing the dimensionality while preserving most of the data's variance, making it easier to visualize and analyze.

Noise Tolerance: PCA can be robust to noise and can help in reducing the impact of noise in high-dimensional data by focusing on the directions with the most variance. However, if the noise in the data is substantial, it can affect the performance of PCA.

Linearity: PCA works well when the underlying data can be well-represented as a linear combination of its features. If the data has complex, nonlinear structures, other dimensionality reduction techniques like manifold learning algorithms may be more suitable.

Implementation on the genomic data using the the source https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy .

In [None]:
!wget ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz
##gff-version   3
##sequence-region   1 1 248956422
##sequence-region   10 1 133797422
##sequence-region   11 1 135086622
##sequence-region   12 1 133275309
...
##sequence-region   MT 1 16569
##sequence-region   X 1 156040895
##sequence-region   Y 2781480 56887902
#!genome-build  GRCh38.p7
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.22
#!genebuild-last-updated 2016-06

In [None]:
import pandas as pd
pd.__version__

'1.5.3'

In [None]:
col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
df = pd.read_csv('Homo_sapiens.GRCh38.85.gff3.gz', compression='gzip',
                         sep='\t', comment='#', low_memory=False,
                         header=None, names=col_names)

In [None]:
df.head()

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes
0,1,GRCh38,chromosome,1,248956422,.,.,.,"ID=chromosome:1;Alias=CM000663.2,chr1,NC_00000..."
1,1,.,biological_region,10469,11240,1.3e+03,.,.,external_name=oe %3D 0.79;logic_name=cpg
2,1,.,biological_region,10650,10657,0.999,+,.,logic_name=eponine
3,1,.,biological_region,10655,10657,0.999,-,.,logic_name=eponine
4,1,.,biological_region,10678,10687,0.999,+,.,logic_name=eponine


In [None]:
df.info()

In [None]:
df.seqid.unique() #df['seqid'] is an alternative

In [None]:
df.seqid.unique().shape

In [None]:
df.source.value_counts()

havana            1441093
ensembl_havana     745065
ensembl            228212
.                  182510
mirbase              4701
GRCh38                194
insdc                  74
Name: source, dtype: int64

In [None]:
gdf = df[df.source == 'GRCh38'] # Choose the source GRCh38
gdf.shape

(194, 9)

In [None]:
gdf.sample(4)

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes
2511457,KI270305.1,GRCh38,supercontig,1,1472,.,.,.,ID=supercontig:KI270305.1;Alias=chrUn_KI270305...
2511783,KI270712.1,GRCh38,supercontig,1,176043,.,.,.,ID=supercontig:KI270712.1;Alias=chr1_KI270712v...
2512198,KI270716.1,GRCh38,supercontig,1,153799,.,.,.,ID=supercontig:KI270716.1;Alias=chr2_KI270716v...
2511483,KI270375.1,GRCh38,supercontig,1,2378,.,.,.,ID=supercontig:KI270375.1;Alias=chrUn_KI270375...
