## **Course Description**
CS 4170 - Data Mining with Applications in the Life Sciences is a computer science elective offered here at Ohio University that is often taken by undergraduate students in their third or fourth year. If you have ever had any interest in mining data or working with data to solve complex problems, CS 4170 would be the perfect course to take to learn more about it. Students in this course will learn all about the Perl programming language and use Perl to design and develop their own custom software to solve real-world life science problems. Some of the topics students will cover in this course include: processing DNA sequences and protein sequences, restriction maps, data pipelines, and the Entrez programming utilities. CS 4170 is an extremely interesting course where students will learn various data mining techniques to help solve modern problems in the life sciences.

## **Learning Outcomes**
- Students will gain the ability to develop Perl programs that combine third party tools to form customized data analysis pipelines
- Students will gain the ability to use the Perl programming language to architect and construct software packages that solve computational biology problems
- Students will gain the ability to develop Perl programs that perform processing of biological sequence data
- Students will learn basic concepts of database management
- Students will learn about features of the Bioperl libraries

## **What You'll Learn**

### **MapReduce**
There are many techniques out there that are used to mine data, but one of the most used techinques right now by individuals and big companies is MapReduce. MapReduce is a programming model implementation that is used for processing large sets of data in parallel. The basic MapReduce algorithms works by dividing up the data set into chunks to be processed on different hardware, and then gathers all of the information up from each process once they are finished to come to some sort of conclusion
> - **A MapReduce program is comprised of three seperate steps:**
>> 1. **The mapping step**
>>> - This is where the data gets filtered and sorted. The results of this step are a collection of (key, value) pairs which represent the mapping of the data we are mining.
>> 2. **The shuffle step**
>>> - The shuffe step acts as an intermediate state between the map and reduce steps. The one job of the shuffle step is to sort all of the (key, value) pairs so the reducer step gets all identical keys
>> 3. **The reducing step**
>>> - The reduce step is where we perform a specific summary operation. In the program below, the reducer step counts all of the different values for the same key and then outputs each unique pair.

A common usage for the MapReduce technique is counting the number of ocurrences of each unique word in a file. Below is a Python program that parses a .txt file and outputs each unique word in the .txt file and the number of times each uniqued word occured in the file.

In [None]:
#Code from https://towardsdatascience.com/5-data-mining-techniques-every-data-scientist-should-know-be06426a4ed9
#Import needed libraries
import sys
from operator import itemgetter
#Define mapper
def mapper(file):
    for line in file.readlines():
        line = line.strip()
        #Get words
        words = line.split()
        #Count
        count = []
        for w in words:
            print('%s\t%s' % (w, 1))
            count.append('%s\t%s' % (w, 1))
    return count
          
#Define reducer
def reducer(counts):
  current_word = None
  current_count = 0
  word = None
  for line in counts:
      line = line.strip()
      #The input we got from mapper
      word, count = line.split('\t', 1)
      try:
          count = int(count)
      except ValueError:
          continue
      # If the word is not None
      if current_word == word:
          current_count += count
      else:
          if current_word:
              print ('%s\t%s' % (current_word, current_count))
          current_count = count
          current_word = word

#Use the functions on any txt file
file1 = open("src/myfile.txt","r+") 
mapped_results = mapper(file1)
#Pass to the resucer
reducer(mapped_results)

### **Frequent Itemset Analysis**
Frequent Itemset Analysis is another popular data mining technique utlized in in the industry these days that uses the market-based model approach for analyzing data. The market-basket is a model that is used to describe a common form of a many to many relationhsip. This data model is used to connect two types of data points, items & baskets. Each basket has a set of items -- hence, itemset -- and it is often assumed that the size of the itemset is smaller than the total number of items.
> - Frequent itemset analysis can be used to categorize and analyze different kinds of applications, for example, let's assume we have some text documents that we want to min for specific words, we can use:
>> - **Related Concepts:** Let items be words, and let baskets be documents. If we want to look for some words that appear in many documents, the sets will be dominated by the most common words in documents, such as stop words or connecting words. We can ignore these words to see the most frequent words in the documents.
>> - **Plagiarism:** In this case, the items will be the documents and the baskets will be the sentences within the document. An item is a part of a basket if the sentence is in the document. If we want to detect plagiarism, then we try to look for pairs of items that appear together in several baskets within two different documents. If we find such a pair, then we have 2 documents that share several sentences in common, which means that plagiarism exists.

In [None]:
#import needed library
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
#Create or import a dataset
dataset = [['Veggies', 'Onion', 'Nutmeg', 'Black Beans', 'Veggies', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Black Beans', 'Veggies', 'Yogurt'],
           ['Veggies', 'Apple', 'Black Beans', 'Eggs'],
           ['Eggs', 'Unicorn', 'Corn', 'Black Beans', 'Yogurt'],
           ['Corn', 'Garlic', 'Garlic', 'Black Beans', 'Ice cream', 'Veggies']]
#Transform into corret formate
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
#Apply algorithm with 70% min confidence
apriori(df, min_support=0.7, use_colnames=True)

## **Conclusion**
CS 4170 is one of the most interesting computer science electives offered here at Ohio University. If you have ever had any interest working with large data sets, using the Perl programming language, or the solving problems in the life sciences, you should seriously consider taking this course. Being a skilled Perl programmer is a rarity in the job market these days so being able to boast your Perl programming skills and software projects developed in this course could intrigue future employers. Also, one of the best parts about this course in particular is that stduents get to learn how to contribute to the world of life sciences by using software to solve modern day problems.