# Data Science Tutorial

### Pre-requisitives

1. Understanding of Python and able to code in it
2. Understanding of Data, various Data Types in Python
3. Basic knowledge of statistics

### Course Outline

#### Importing data, Data Cleaning and Descriptive Analytics

##### Importing data

1. Read data from local machine
1.1 Read from csv file
1.2 Read from Excel file
1.3 Read from text file
1.4 Read from other data file (html, json)

2. Connect to a server
2.1 Connecting to a SQL server
2.2 Connecting to Amazon Cloud (S3, Redshift)


#### Descriptive Analytics

To see how the data looks like, number of rows and column, their data type and other basic statistical paramters (mean, std)

What?
1. Summary of the data
2. Checking for NULL values in data
3. Checking for outliers in Data
4. Checking for skewness
5. Checking for class imbalance
6. Checking for distribution of various categorical columns

How?
1. Using various python functions
2. Using some packages
2.1 pandas-profiling

#### Data Visualization

What?
1. Plotting basic curves
2. Looking at the distribution
3. Checking for relations and patterns in data

What else?
1. Line Chart
2. Bar and Column Chart
3. Histogram
4. Pie Chart
5. Scatter Plot
6. Box Plot
7. Other charts (area chart, dendogram, radar chart, heat map, bubble chart, 3-D plots)

References: https://python-graph-gallery.com/all-charts/

How?
Libraries:
    1. matplotlib
    2. seaborn
    3. ggplot
    4. plotly

Tableau is also a good tool to be explored.

References: https://mode.com/blog/python-data-visualization-libraries/    

#### Data Cleaning

What?
1. Treating/Imputing missing values
2. Treating/Removing Outliers
3. Removing garbage values
4. Removing unnessary columns
5. Fixing data types
6. Cleaning entire column element-wise using .apply family
7. Data cleaning of values in column (removing leading zeros, dropping special characters)
8. Cleaning dates and reformatting them
9. Converting string nan into NaN values
10. Conditional cleaning using np.where
11. Renaming columns
12. Capitilizing and lowering strings
13. Dropping rows/columns with missing values

References: https://realpython.com/python-data-cleaning-numpy-pandas/
            https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d
            https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
            https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3

How?
We will go through each of the above in detail below

You can again visualize the data to make sure that it is good for further analysis 
I prefer doing it with pandas-profiling. But, this will take too much time on large dataset.
pandas-profiling: https://github.com/pandas-profiling/pandas-profiling

### Machine Learning
Unsupervised Learning

Supervised Learning

### Unsupervise Learning
Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human-labeled data, unsupervised learning, also known as self-organization allows for modeling of probability densities over inputs.

##### Applications of unsupervised machine learning
- Clustering automatically split the dataset into groups base on their similarities
- Anomaly detection can discover unusual data points in your dataset. It is useful for finding fraudulent transactions
- Association mining identifies sets of items which often occur together in your dataset
- Latent variable models are widely used for data preprocessing. Like reducing the number of features in a dataset or decomposing the dataset into multiple components

##### Disadvantages of Unsupervised Learning
- You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known
- Less accuracy of the results is because the input data is not known and not labeled by people in advance. This means that the machine requires to do this itself.
- The spectral classes do not always correspond to informational classes.
- The user needs to spend time interpreting and label the classes which follow that classification.
- Spectral properties of classes can also change over time so you can't have the same class information while moving from one image to another.

References:  
https://www.guru99.com/unsupervised-machine-learning.html  
            https://en.wikipedia.org/wiki/Unsupervised_learning  
            https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a
            

1. Clustering  
1.1 k-means  
1.2 Hierarchical clustering  
1.3 K-NN (k nearest neighbors)  
1.4 Principal Component Analysis  
  
2. Anomaly Dtection  

References:  
https://algorithmia.com/blog/introduction-to-unsupervised-learning

### Supervise Learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.  
  
In order to solve a given problem of supervised learning, one has to perform the following steps:  

- Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set.  
- Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.  
- Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.  
- Determine the structure of the learned function and corresponding learning algorithm.  
- Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.  
- Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.

1. Linear regression 
2. Logistic regression
3. Support Vector Machines
4. Naive Bayes
5. Decision trees
6. Random Forest
7. Neural Network
  
Then hyperparameters are tuned for individual dataset

### Further Learnings
1. Feature Engineering
2. Balancing class-imbalance in Data
3. Synthesis the findings
4. Using external data to further enrich the existing dataset
5. ...