![CS 4150](images/cs4150-title.png)

## **Course Description**
CS 4900 - Data Science - is an elective course in Ohio University's computer science program. Students who complete this course will gain a thorough understanding of algorithmic techniques and processes for data science by implementing them and by applying them to solve real world problems. Students will gain not only an understanding of data science models and methods, but they also will learn how to perform analyses to answer domain research questions and to effectively communicate insights that result from their analyses. Students will be able to clean, preprocess, and explore real world data sets. Students will be able to implement algorithms that provide computational models for graph analytics and for clustering. Students will be able to characterize the statistical significance of their computational models. Finally, students will be able to visualize and interpret the significance of the computational models in authentic domains. 

## **What You'll Learn**
Data science is the process of analyzing, visualizing, and working with a large dataset. In this class, the large dataset that you will be using involves data from mouse embryonic stem cells. It is important to review and understand the following biological concepts to throughly grasp what the information in the dataset is about. Click on each video to learn more about the following concepts: 
- Tracking the human genome in 4D: https://www.youtube.com/watch?v=Q_KdrtsmYoE
- Chromatin: https://www.youtube.com/watch?v=p-khsHRDqeA
- DNA Structure: https://www.youtube.com/watch?v=8Ayp7ReOUG8 
- A 3D Map of the Human Genome: https://www.youtube.com/watch?v=dES-ozV65u4
- The nucleus: https://www.khanacademy.org/test-prep/mcat/cells/eukaryotic-cells/v/the-nucleus

To understand the process of how researchers gathered the data in the dataset, please read the following article. https://www.nature.com/articles/nature21411

### **A Large Dataset**

A dataset text file has been posted in the src folder, and is called, "cs4150dataset.txt".

Each column of the data is a nuclear profile (NP), which was obtained by taking a random slice from the nucleus of single embryonic stem cell from a mouse. Each column represents a slice from a different cell. 

The first row in the file contains the names of NPs. Each NP name begin with the letter ‘F.’  An example NP name is: F10A2. Each remaining row of the data represents a genomic window of 30,000 contiguous nucleotides  (a.k.a. base pairs). The format of a row is as follows.

The first three columns of the spreadsheet denote the coordinates of a genomic window.
The following example denotes a genomic window that begins at position ‘0’ on chromosome 1 and stops at position 30000 on chromosome 1:
1.	Chromosome = chr1
2.	Start position on the chromosome = 0
3.	Stop position on the chromosome = 30000

Each cell (i, j) of the matrix contains either a ‘1’ or a ‘0’: <br>
- ‘1’: A genomic window i was present in NP j
- '0’: A genomic window i was not present in NP j

In [None]:
# EXERCISE: 

# Write a program to read the file and compute the following:
#  - Number of genomic windows (ANSWER YOU SHOULD GET: 90877)
#  - Number of NPs (ANSWER YOU SHOULD GET: 408)
#  - On average, how many windows are present in an NP? (ANSWER YOU SHOULD GET: 5482.81)

# COLUMNS in data correspond to Nuclear Profiles (NP)
# ROWS in data correspond to Windows

data_file = "./src/cs4150dataset.txt"
all_data_list = []

window_count = 0
profile_count = 0
windows_in_np_average = 0

# Reading in input file ... 
with open(data_file) as f:
    for line in f:
        list = line.split()
        all_data_list.append(list)

# NEEDS FINISHED ... 


print("Window count: ", window_count)
print("Number of NP's: ", profile_count)
print("Average number of windows in an NP: ", windows_in_np_average)

### **Visualizing Data**
The Python library, matplotlab, contains many useful methods for displaying and visualizing data. Learn more about matplotlib's scatterplot, line graph, and heatmap functionalities in the following examples:

#### **Scatterplots**
A scatterplot uses dots to plot data on a 2D graph, and should be used when comparing two variables. In the example below, the "x" variable contains a list of all x value coordinates, and the "y" variable contains a list of all y value coordinates. To plot the coordinates in a scatterplot, simply pass each list to the plt function, scatter(). Run the code to see the scatterplot.

In [None]:
# Example of a Scatterplot: 
%matplotlib inline
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.title("Plot of y list vs the x list")
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.scatter(x, y)
plt.show()

#### **Line Graphs**
A line graph is used when plotting a variable v.s. a certain amount of time. In the following example, the unemployment rate is plotted over a 100 year time period. The list of years is stored in "Year", and the rate is stored in "Unemployment_Rate". To plot a line graph, plt's plot() function is called, passing the x variable, y variable, color, and marker. Titles and axis labels can then be added. Run the code to see the result.

In [None]:
# Example of a Line graph: 
%matplotlib inline
import matplotlib.pyplot as plt
   
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020]
Unemployment_Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3,9.5]
  
plt.plot(Year, Unemployment_Rate, color='red', marker='o')
plt.title('Unemployment Rate Vs Year', fontsize=14)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Unemployment Rate', fontsize=14)
plt.grid(True)
plt.show()

#### **Heatmaps**
A heatmap is a graphical representation of data that uses a system of color coding to represent a dataset's values. In the example below, random numbers between 0 and 1 are plotted in a 16 by 16 matrix. A heatmap is generated using plt's imshow() function, passing the matrix, color scheme, and interpolation. In this case, the red-blue color scheme is used. Also, plt's colorbar() function displays a color coded bar to the left that explains what the colors represent. Run the example below to view the heatmap.

In [None]:
# Example of a heatmap:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

a = np.random.random((16, 16))
plt.imshow(a, cmap='RdBu_r', interpolation='nearest')
plt.clim(0,1)
plt.colorbar()
plt.show()

### **K-Means Clustering**
Many datasets contain patterns, where some data values are similar to others. K-means clustering is a machine-learning algorithm to properly put data into clusters that are similar to each other. In this class, you will learn and implement the K-means clustering algorithm to place data into different clusters. To learn more about clustering and how it is used, please watch the following videos.
- Examples of clustering: https://www.coursera.org/lecture/ml-foundations/other-examples-of-clustering-cmh30
- K-means clustering:
    - https://www.youtube.com/watch?v=4b5d3muPQmA
    - https://www.youtube.com/watch?v=5I3Ei69I40s
    
**STEPS IN K-MEANS CLUSTERING** <br>
1.) Select the number of clusters you want. This is “k” <br>
2.) Randomly select k distinct points <br> 
3.) Measure the (Euclidean) distance between each point and each cluster <br>
4.) Assign each point to the nearest cluster <br>
5.) Calculate the mean of each cluster and relocate clusters! <br>
6.) Remeasure and re-cluster using the mean cluster values <br>
7.) Do it over with different starting points!!! <br>

![SegmentLocal](images/cs4150-giphy.gif)

## **Conclusion**
In CS 4150, you will be introduced to many important data science concepts and techniques. You will gain experience working with a large dataset containing information about stem embroytic cells in mice. Using this dataset, you will write code to explore the various attributes of it, including detecting patterns, similarity and distance metrics, clustering, identifying subgroups, and co-segregation. Another key aspect of data science is displaying data in effective manner. In this class, students will learn how to create and display scatterplots, various graphs, heatmaps, boxplots, and more!