# BIOS470/570 Lecture 7

## Last time we covered:
* ### Introduction to gene expression measurements
* ### pandas data frames, indexing with .loc and .iloc

## Today we will cover:
* ### Missing data, duplicated data, and string operations
* ### merging multiple data sets with pandas

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
human_data = pd.read_excel('data/GSE137492_SupplementaryTable1.xlsx')
frog_data = pd.read_csv('data/xen_uic_hik_stage8_13_30min.tsv',delimiter='\t')

## More on dealing with missing data:

### The functions .isna and .notna finding the missing or not missing data:

In [None]:
human_data.isna()

In [None]:
human_data.notna()

In [None]:
human_data.loc[human_data.notna().any(axis = 1)]

In [None]:
human_data.fillna("No Name")

### Many commands take an inplace argument so that it modifies but does not return the variable:

In [None]:
human_data.dropna(inplace=True)

In [None]:
human_data

In [None]:
human_data.index = human_data.loc[:,"genes"]
human_data.drop("genes",axis=1,inplace=True)

In [None]:
human_data

### Pandas has commands for dealing with duplicated data. Both the dataframe and its index object can be marked for duplicates

In [None]:
human_data.duplicated()

In [None]:
human_data.duplicated().any()

In [None]:
human_data.loc[human_data.index.duplicated()]

### Notice each duplicate only appears once, what if we wanted to see them all. This is showing us all instances of the duplicates except for the first. The following shows all except the last one:

In [None]:
human_data[human_data.index.duplicated(keep = "last")]

### And this shows them all:

In [None]:
human_data[human_data.index.duplicated(keep = False)]

### Are there duplicated ensembl ids?

In [None]:
human_data.duplicated(subset="geneIds").any()

In [None]:
human_data.index.duplicated().any()

### Pandas has a number of built in string methods you can use for filtering and manipulating string data inside pandas objects

In [None]:
human_data.index.str.contains("BMP")

In [None]:
human_data.loc[human_data.index.str.contains("BMP")]

In [None]:
human_data.index.str.lower()

### The .match method is for matching with regular expressions: (this is BMP, following by the numbers from 1 to 9, followed by the end of the string.

In [None]:
human_data.loc[human_data.index.str.match("BMP[1-9]{1,2}$")]

In [None]:
human_data.loc[human_data.index.str.match("SOX[1-9]{1,2}$")] #"SOX" following by a number from 1 to 9 once or twice

In [None]:
human_data.loc[human_data.index.str.match("SOX[1-9]{2}$")] #needs two digits

### Let's look at our frog data again and then manipulate the index so that it is similar to the human one:

In [None]:
frog_data

In [None]:
frog_genes = frog_data.loc[:,"Gene"]

In [None]:
frog_genes

In [None]:
frog_gene_names = np.zeros(len(frog_genes)).astype(str)
for ii in range(len(frog_genes)):
    if frog_genes[ii].count('|') > 0:
        gn = frog_genes[ii].split('|')
        frog_gene_names[ii] = gn[1]
    else:
        frog_gene_names[ii] = np.nan
        

In [None]:
frog_data.index = frog_gene_names
frog_data

In [None]:
frog_data = frog_data.loc[~(frog_data.index == "nan")]
frog_data

In [None]:
frog_data.index = frog_data.index.str.upper()

### Now we have frog data and human data with compatible indexes 

### Now let's see some methods for combining the human and frog data

### Merge is a very general command which can combine two datasets. #use the indexes to join them, you could also specify columns to join by with left_on and right_on arguments

In [None]:
pd.merge(human_data,frog_data,left_index=True, right_index=True) 

### Join uses the left index to decide which rows to include. So this has all human data and the corresponding frog_data

In [None]:
human_data.join(frog_data) 

### This has all frog data with corresponding human data:

In [None]:
frog_data.join(human_data)

### In either case, we can dropna to get the intersection:

In [None]:
merged = frog_data.join(human_data).dropna()
merged 

In [None]:
all_data = merged.drop(["Gene","geneIds"],axis = 1)

### The pcolor function is useful for making colormaps of this data. It is helpful to sort or otherwise organize the data to see the trends. Taking logarithms also helps you see the full range of data. Notice the +1 to deal with numerical issues (i.e. if the data is x, we look at log2(x+1)

In [None]:
fig = plt.figure(figsize = (12,6))
ax = fig.add_subplot(1,2,1)
ax.pcolor(np.log2(all_data.to_numpy()+1))
ax.set_title("unsorted")
ax = fig.add_subplot(1,2,2)
ax.pcolor(np.log2(all_data.sort_values("UIC_1").to_numpy()+1))
ax.set_title("sorted")