## Inferential Statistics
<br/>
At this point, you have obtained the data set for your Capstone project, cleaned and wrangled it into a form that's ready for analysis. It's now time to apply the inferential statistics techniques you have learned to explore the data. For example, are there variables that are particularly significant in terms of explaining the answer to your project question? Are there strong correlations between pairs of independent variables, or between an independent and a dependent variable? 
 
<b> Submission: </b>  
Write a short report (1-2 pages) on the inferential statistics steps you performed and your findings. Check this report into your github and submit a link to it. Eventually, this report can be incorporated into your Milestone report.

## Capstone Project 1: IR spectral analysis of organic compounds via machine learning approach

***
<b> Table of contents </b>   
&nbsp;&nbsp; I.   Data preparation  
&nbsp;&nbsp; II.  Difference between two curves  
&nbsp;&nbsp; III.  Molecular weight distribution analysis  
&nbsp;&nbsp; IV. Conclusions

****
### I. Data preparation


<b> 1) Import NIST_chemicals_list_organic.csv that was prepared earlier and check </b>

In [2]:
import pandas as pd

df=pd.read_csv('NIST_chemicals_list_organic.csv',index_col=['CAS']) #set CAS column as index 
df.head()

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100-00-5,"Benzene, 1-chloro-4-nitro-",C6H4ClNO2,"['C', 'H', 'Cl', 'N', 'O']",156.993056
100-01-6,p-Nitroaniline,C6H6N2O2,"['C', 'H', 'N', 'O']",138.042927
100-02-7,"Phenol, 4-nitro-",C6H5NO3,"['C', 'H', 'N', 'O']",139.026943
100043-29-6,2H-Tetrazole,CH2N4,"['C', 'H', 'N']",70.027946
100046-00-2,"2,2,4,4',6,6'-Hexamethylazobenzene N,N'-dioxide",C18H22N2O2,"['C', 'H', 'N', 'O']",298.168128


<b> 2) Randomly extract 50 samples from the list and inspect the Mw with that published on NIST website to validate </b>

In [3]:
list_random=df.sample(50).index.tolist() #convert the CAS column into a list
list_random[1:10] #check list by inspecting the first ten items

['93-98-1',
 '730-46-1',
 '5348-74-3',
 '106-02-5',
 '474083-29-9',
 '106-49-0',
 '13048-17-4',
 '7307-03-1',
 '5240-32-4']

In [57]:
import requests
from bs4 import BeautifulSoup
import re
from decimal import Decimal


In [78]:
Mw_Regex=re.compile(r'\d+\.\d+')

for cas in list_random:
    url='http://webbook.nist.gov/cgi/cbook.cgi?ID=%s' %(cas) #go to NIST website to look for the compound via its CAS number
    #print(url)
    r=requests.get(url)
    html_doc=r.content
    soup=BeautifulSoup(html_doc,"html.parser")
    
    #print(soup.find("main",{"id":"main"}).find("ul").find_all("li")[1].text)
    
    Mw=soup.find("main",{"id":"main"}).find("ul").find_all("li")[1].text
    #print(Mw)
    Mw_NIST=Mw_Regex.search(Mw).group()
    int(Decimal(Mw_NIST)) #turn the string into decimal, and then turn it into an integer.
    #print(int(Decimal(Mw_NIST))) #check
    
    #compare with the Mw in the table
    if int(Decimal(Mw_NIST))>=(int(df[df.index==cas]['Mw'])*1.1) or int(Decimal(Mw_NIST))<=(int(df[df.index==cas]['Mw'])*1.1):
        #print('excellent!')
        pass
    else:
        print('Mw_NIST:', Mw_NIST, 'is not the same as Mw:', int(df[df.index==cas]['Mw']))    


From 50 entries that are randomly selected from NIST_chemicals_list_organic.csv, no disagreement is found between the calculated molecular weight, Mw, and those listed on http://webbook.nist.gov/ website while allowing a plus/minus 10% error.

<b> 3) Extract chemicals NIST_chemicals_list_organic.csv that contain specific organic functional groups </b>  
Need to consider of a more efficient way of storing each cluster of functional groups than separate DataFrames

In [82]:
#Need to narrow down the search by limiting the elemental compositions
#(1) search for esters
df_ester=df[df.Name.str.endswith('oate')]
df_ester.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100062-67-7,"2-(2,4,5-Trichlorophenoxy)propyl 2,2-dichlorop...",C12H11Cl5O3,"['C', 'H', 'Cl', 'O']",377.915083
10032-02-7,Geranyl caproate,C16H28O2,"['C', 'H', 'O']",252.20893
101166-96-5,"2,2,2-Trichloro-1-cyanoethyl 4-chlorobenzoate",C10H5Cl4NO2,"['C', 'H', 'Cl', 'N', 'O']",310.907439


In [83]:
#(2) search for ketones
df_ketone=df[df.Name.str.endswith('none')]
df_ketone.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001-45-2,6-Pentadecanone,C15H30O,"['C', 'H', 'O']",226.229666
1003-04-9,Dihydro-3-(2H)-thiophenone,C4H6OS,"['C', 'H', 'O', 'S']",102.013936
1003-10-7,Dihydro-2-(3H)-thiophenone,C4H6OS,"['C', 'H', 'O', 'S']",102.013936


In [87]:
#(3) search for alcohols
df_alcohol=df[(df.Name.str.endswith('ol')) & (df.Formula.str.contains('O'))]
df_alcohol.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100144-29-4,Cyclopentylethynylmethyl carbinol,C9H14O,"['C', 'H', 'O']",138.104465
1002-28-4,3-Hexyn-1-ol,C6H10O,"['C', 'H', 'O']",98.073165
100-49-2,Cyclohexanemethanol,C7H14O,"['C', 'H', 'O']",114.104465


In [91]:
#(4) search for alkanes
df_alkane=df[(df.Name.str.endswith('ane')) & (df.Formula.str.contains('C')) & (~df.Formula.str.contains('Si'))]
df_alkane.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000-63-1,1-Tert-butoxybutane,C8H18O,"['C', 'H', 'O']",130.135765
100098-21-3,"1,7-Diazabicyclo[5.3.1]undecane",C9H18N2,"['C', 'H', 'N']",154.146999
100098-22-4,"1,8-Diazabicyclo[6.3.1]dodecane",C10H20N2,"['C', 'H', 'N']",168.162649


In [90]:
#(5) search for alkenes
df_alkane=df[(df.Name.str.endswith('ene')) & (df.Formula.str.contains('C')) & (~df.Formula.str.contains('N'))]
df_alkane.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1002-26-2,"1,3-heptadiene",C7H12,"['C', 'H']",96.0939
1002-27-3,"1,3,6-Heptatriene",C7H10,"['C', 'H']",94.07825
1002-33-1,"1,3-Octadiene",C8H14,"['C', 'H']",110.10955


In [93]:
#(6) search for amines
df_amine=df[(df.Name.str.endswith('amine')) & (df.Formula.str.contains('N'))]
df_amine.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001-53-2,N-Acetylethylenediamine,C4H10N2O,"['C', 'H', 'N', 'O']",102.079313
1003-03-8,Cyclopentanamine,C5H11N,"['C', 'H', 'N']",85.089149
100-46-9,Benzylamine,C7H9N,"['C', 'H', 'N']",107.073499


In [95]:
#(7) search for aldehydes
df_aldehyde=df[(df.Name.str.endswith('anal')) & (df.Formula.str.contains('O')) | (df.Name.str.endswith('hyde'))]
df_aldehyde.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1003-29-8,1H-Pyrrole-2-carboxaldehyde,C5H5NO,"['C', 'H', 'N', 'O']",95.037114
100-50-5,3-Cyclohexene-1-carboxaldehyde,C7H10O,"['C', 'H', 'O']",110.073165
100-52-7,Benzaldehyde,C7H6O,"['C', 'H', 'O']",106.041865


In [96]:
#(8) search for organic acids
df_acid=df[(df.Name.str.contains('acid')) & (df.Name.str.contains('ic')) & (df.Formula.str.contains('O'))]
df_acid.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100-07-2,4-Methoxybenzoic acid chloride,C8H7ClO2,"['C', 'H', 'Cl', 'O']",170.013457
100-09-4,"Benzoic acid, 4-methoxy-",C8H8O3,"['C', 'H', 'O']",152.047344
100096-59-1,"Carbanilic acid, 2-tert-butyl-, ethyl ester",C13H19NO2,"['C', 'H', 'N', 'O']",221.141579


In [97]:
#(9) search for ethers
df_ether=df[(df.Name.str.contains('oxy')) & (df.Name.str.contains('ether')) & (df.Formula.str.contains('O'))]
df_ether.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
114811-41-5,(2-Hydroxyethylthio)ethyl (vinylthio)ethyl ether,C8H16O2S2,"['C', 'H', 'O', 'S']",208.059172
116373-69-4,2-(2-Cyclohexylphenoxy) ethyl-2-phenoxyethyl e...,C22H28O3,"['C', 'H', 'O']",340.203845
116401-27-5,"Di-[2-(4-tert-butyl-2,6-dichlorophenoxy) ethyl...",C24H30Cl4O3,"['C', 'H', 'Cl', 'O']",506.094906


In [None]:
#(10) search for halides
"""
NEED TO CORRECT
halide_list=['F','Cl','Br','I']

df_halide=df[df.Elements.str.contains(halide_list)]
df_halide.head()
"""


In [98]:
#(11) search for thiols
df_thiol=df[((df.Name.str.contains('mercap')) | (df.Name.str.contains('thiol'))) & (df.Formula.str.contains('S'))]
df_thiol.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100-38-9,"Ethanethiol, 2-(diethylamino)-",C6H15NS,"['C', 'H', 'N', 'S']",133.092521
100-53-8,Benzenemethanethiol,C7H8S,"['C', 'H', 'S']",124.034671
1005-55-6,"2-Propanone, 1-(5-methyl-3H-1,2-dithiol-3-ylid...",C7H8OS2,"['C', 'H', 'O', 'S']",172.001657


In [100]:
#(12) search for thiophenes
df_thiophene=df[(df.Name.str.contains('thiophene')) & (df.Formula.str.contains('S'))]
df_thiophene.head(3)

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100182-47-6,"4,5-Dihydrothiophene, 3-methyl-",C5H8S,"['C', 'H', 'S']",100.034671
1013-23-6,"Dibenzothiophene, 5-oxide",C12H8OS,"['C', 'H', 'O', 'S']",200.029586
10133-30-9,Benzo[b]thiophene-5-carboxaldehyde,C9H6OS,"['C', 'H', 'O', 'S']",162.013936


In [104]:
#(13) search for anilines
df_aniline=df[(df.Name.str.contains('aniline')) & (df.Formula.str.contains('N'))]
df_aniline.head()

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100-01-6,p-Nitroaniline,C6H6N2O2,"['C', 'H', 'N', 'O']",138.042927
1022-07-7,"2,4,6-Trinitro-N-methyl-aniline",C7H6N4O6,"['C', 'H', 'N', 'O']",242.028734
103-70-8,N-Formylaniline,C7H7NO,"['C', 'H', 'N', 'O']",121.052764
10394-64-6,3-Iodo-5-nitroaniline,C6H5IN2O2,"['C', 'H', 'I', 'N', 'O']",263.939574
10403-47-1,2-Bromo-5-nitroaniline,C6H5BrN2O2,"['C', 'H', 'Br', 'N', 'O']",215.95344


In [None]:
#(14) search for benzenes
df_benzene=df[((df.Name.str.contains('phenyl')) | (df.Name.str.contains('benzene'))) & (df.Formula.str.contains('C'))]
df_benzene.head()

In [105]:
#(15) search for pyridines
df_pyridine=df[(df.Name.str.contains('pyridine')) & (df.Formula.str.contains('N'))]
df_pyridine.head()

Unnamed: 0_level_0,Name,Formula,Elements,Mw
CAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1072-98-6,2-Amino-5-chloropyridine,C5H5ClN2,"['C', 'H', 'Cl', 'N']",128.014126
1120-87-2,4-Bromopyridine,C5H4BrN,"['C', 'H', 'Br', 'N']",156.952712
1121-31-9,2-Mercaptopyridine-N-oxide,C5H5NOS,"['C', 'H', 'N', 'O', 'S']",127.009185
1121-55-7,3-ethenylpyridine,C7H7N,"['C', 'H', 'N']",105.057849
1121-76-2,4-Chloropyridine-N-oxide,C5H4ClNO,"['C', 'H', 'Cl', 'N', 'O']",128.998141


<b> 4) Randomly select and import chemicals from NIST database until you obtain 50 jcamp files per functional group </b>

In [123]:
#NEED to inspect and fix

import urllib.request
import requests

cas='81432'
url="http://webbook.nist.gov/cgi/cbook.cgi?JCAMP=C%s&Index=0&Type=IR" %(cas)
#request=Request(url)
#request

#import urllib
if len(requests.get(url).content) >= 1000:
    url_ret = urllib.request.urlretrieve(url, "%s.jcamp" %(df.loc[cas, 'Name'].replace(",", "").replace("'", "").strip() + "_" + cas))  # save file according to its cas_no
    print(url_ret)

<b>5)  Spectral treatment </b>   
- Read jcamp file.  
- Treat spectra.  
- Uniformize units, etc.  
- Do features extraction/classification.

### II. Difference between two curves
Calculate residuals between two curves and determine the distributions:    
    -Perform test for normality, q-q plots  
    -Calculate Pearson correlation coefficients between two curves to test for likeness. Compare it to Euclidean distance.

### III. Molecular weight distribution analysis  
-Plot out histograms of molecular weights of all 15 functional groups and inspect for obvious patterns

### IV. Conclusions