# Skill Check 8

The block below imports the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import pylab as plt

## 1. Data Organization (50 pts)

### 1a: Dow Dataset (5 pts)

Read in the `impurity_dataset-training.xlsx` as a `pandas.DataFrame` with a variable name `df`.

In [2]:
########################################
# Start your code here
df = pd.read_excel('impurity_dataset-training.xlsx')
########################################

In [3]:
assert type(df) == pd.core.frame.DataFrame
assert df.shape == (10703, 46)
assert np.isclose(np.linalg.norm(df[df.columns[1:]].loc[1]), 3381.2181210675867)

### 1b: Time data (15 pts)

Create a subset of the Dow dataset `df_time` containing only data from Dec. 5 - Dec. 12, 2015. The data should only include columns from `x1` to `x12`.

**Hint:** The problem is much easier if you index by the date.

In [4]:
########################################
# Start your code here
df_time = df.set_index('Date')['2015-12-05':'2015-12-12']
df_time = df_time[df_time.columns[:12]]
########################################

In [5]:
assert df_time.shape == (192, 12), "wrong df_time"
assert np.isclose(np.linalg.norm(df_time.iloc[4]), 2910.9186137459646), "wrong df_time"

### 1c: Removing inconsistent null values (20 pts)

Create a dataframe of all features in the Dow dataset called `df_no_express` that does not contain any expression marks, but still includes blank/null values. Expression marks should be replaced with 0.0. `df_no_express` should only include columns from `x1` to `x40` and the column order should not be changed.

In [6]:
########################################
# Start your code here
df_no_express = df[df.columns[1:41]].replace('!', 0.0)
########################################

In [7]:
assert ('!' in df_no_express.values) == False, 'expression marks not eliminated'
assert df_no_express.columns[1] == 'x2:Primary Column Tails Flow', 'wrong columns'
assert df_no_express.shape == (10703, 40), 'wrong columns'

  """Entry point for launching an IPython kernel.


### 1d: Mean and variance of features (10 pts)

Save the mean and variance of each feature in `df_no_express` to `mean_no_express` (numpy array) and `var_no_express` (numpy array), respectively.

In [8]:
########################################
# Start your code here
mean_no_express = df_no_express.mean(axis = 0)
var_no_express = df_no_express.var(axis = 0)
########################################

In [9]:
assert len(mean_no_express) == 40
assert len(var_no_express) == 40
assert np.isclose(np.linalg.norm(mean_no_express) * np.linalg.norm(var_no_express), 1857023957.6543484)

## 2. APIs (50 pts)

### 2a: Get CID (15 pts)

Write a function `getCID` that takes the name of a compound `name` as an argument and uses the PubChem RESTful API to return its CID. The data type of returned value should be **integer**.

In [10]:
import requests

In [11]:
def getCID(name):
########################################
# Start your code here
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/cids/TXT'.format(name))
    return int(r.text)
########################################

In [12]:
assert getCID('methanol') == 887, "test run #1 failed"
assert getCID('ethanol') == 702, "test run #2 failed"
assert getCID('1,5-heptadiene') == 5364394, "test run #3 failed"

### 2b: Get SMILES (15 pts)

Write a function `getSMILES` that takes the CAS number of a compound (`CAS`, string) as an argument and uses the PubChem RESTful API to extract its record as a JSON, and then extract the SMILES notation (string) from this JSON record.

In [13]:
import json

In [14]:
def getSMILES(CAS):
########################################
# Start your code here
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/record/json'.format(CAS))
    chem_json = json.loads(r.text)
    
    smiles = chem_json['PC_Compounds'][0]['props'][18]['value']['sval']
    
    return smiles
########################################

In [15]:
assert getSMILES('108-95-2') == 'C1=CC=C(C=C1)O', 'test run #1 failed'
assert getSMILES('64-17-5') == 'CCO', 'test run #2 failed'
assert getSMILES('627-20-3') == 'CCC=CC', 'test run #3 faield'

### 2c: Count Bonds (20 pts)

Write a function `countBonds` that takes an arbitrary chemical name or CAS number (`name`, string), uses the PubChem RESTful API, and returns the number of C-H bonds in the compound.

In [16]:
def countBonds(name):
########################################
# Start your code here
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/record/json'.format(name))
    chem_json = json.loads(r.text)
    
    bonds = chem_json['PC_Compounds'][0]['bonds']
    aid_1 = np.array(bonds['aid1']).reshape(-1, 1)
    aid_2 = np.array(bonds['aid2']).reshape(-1, 1)
    
    bond_pair = np.append(aid_1, aid_2, axis = 1)
    
    atoms = chem_json['PC_Compounds'][0]['atoms']
    
    count = 0
    for pair in bond_pair:
        if atoms['element'][pair[0] - 1] == 6 and atoms['element'][pair[1] - 1] == 1:
            count += 1
        elif atoms['element'][pair[0] - 1] == 1 and atoms['element'][pair[1] - 1] == 6:
            count += 1
    
    return count
########################################

In [17]:
assert countBonds('ethanol') == 5, "test run #1 failed"
assert countBonds('water') == 0, "test run #2 failed"
assert countBonds('108-95-2') == 5, "test run #3 failed"