## Predicting superconducting transition temperatures from material database.

### Background

We will use materials data for superconducting materials to create machine learning models to predict the superconducting transition temperature of superconducting materials. We will not try to predict the potential of a material to be a superconductor, as this is a question that far exceeds the scope of this class project.

Your project should use as a guide the relevant sections of the paper 
**Machine learning modeling of superconducting critical temperature**
by Stanev et al.

* published: https://www.sciencedirect.com/science/article/pii/S0927025618304877
* arXiv: https://arxiv.org/abs/1709.02727

However, to simplify data collection, we will use the dataset described in the paper
**A Data-Driven Statistical Model for Predicting the Critical Temperature of a Superconductor**
by Kam Hamidieh 

* arXiv https://arxiv.org/abs/1803.10260
* github https://github.com/khamidieh/predict_tc/blob/master/paper_3.pdf

The dataset is available in the github project: https://github.com/khamidieh/predict_tc
This dataset is based in part on the same sources as the Stanev paper but in its final form differs considerably in the choice of variables from the data used in the Stanev paper. Hence you should not expect identical results.


### Part 1 - Read the papers

Read the papers. You are not expected to understand everything in this papers. Reading a paper for the first time requires you to skip over details and extract the most important information for your purposes.

#### Formulate 3 questions and email to the instructor. Due date: Thurday, October 23, 5pm.

### Part 2 - Extract the dataset (optional) and read it into python data structures
The Hamidieh dataset can be downloaded as part of a github project provided by the author: https://github.com/khamidieh/predict_tc .
However the dataset is in a binary format that is not directly accessible from python. Follow the instructions in the github project to extract datafiles that can be imported into python. 

To simplify this step the data is provided in clear text format in two files (canvas for now). What information does each file contain?

### Part 3 - Visualization and exploration of the data

Explore and visualize the dataset. Answer  questions like:How many variables? How many entries? Which elements appear how often? Use the figures in the papers as a guide. At the minimum provide figures of superconducting transition temperature distribution in the dataset. Create seperate plots for entire dataset and the 3 classes of superconductors discussed in the Stanev paper:

- $T_c<10K$. 
- Iron based superconductors, i.e. material composition contains $Fe$.
- HTC superconductors, i.e material composition contains $Cu$ and $O$ in roughly (but not exclusively) in a $1:2$ ratio indicating $Cu$-$O$ planes. Note, that for example $YBa_2Cu_4O_7$ is a HTC superconductor and $4:7$ is roughly $1:2$ in this context. The $Cu$ $O$ ratio can deviate strongly from $1:2$ if oxygen is also present in other layers other than the $Cu$-$O$ planes.
- For each element determine for the compounds that contain it the average and standard deviation of the superconducting transition temperatures. Graph your results (2 figures)
- Plot the superconducting transition temperature against the following properties (one figure each): *mean_atomic_mass*, *range ThermalConductivity*, *range atomic radius*.




### Due date for figures and code to generate the figures is Monday 11/9.

In [2]:
''' 
Data files are already downloaded from canvas as .csv, all that is left is to read it
''' 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import os 

print(os.getcwd())


/Users/luke/Desktop/Python/GitClone/scientific-computing-archive/sci-comp-II/Week-3


### Part 1 
Sent to Prof. Schneider.

Questions are as follow:

1. Why is the multiple regression model not used for prediction? And if it were to be used as a benchmark model, what benefit is there if it cannot be used for predicting future trends or data points? 

2. Is there a time where R is preferred over python for data science? How does python compare to using R when it comes down to analysis of data points? 

3. Is there a percentage threshold value to consider the results as useful and accurate when using ML to analyze the datasets? 


### Part 2

Extract and read into python data structures

In [13]:
df_t = pd.read_csv("train.csv") # Files are already located in cwd, no need for full path 
df_um = pd.read_csv("unique_m.csv")
print(df_t)
print(df_um)

       number_of_elements  mean_atomic_mass  wtd_mean_atomic_mass  \
0                       4         88.944468             57.862692   
1                       5         92.729214             58.518416   
2                       4         88.944468             57.885242   
3                       4         88.944468             57.873967   
4                       4         88.944468             57.840143   
...                   ...               ...                   ...   
21258                   4        106.957877             53.095769   
21259                   5         92.266740             49.021367   
21260                   2         99.663190             95.609104   
21261                   2         99.663190             97.095602   
21262                   3         87.468333             86.858500   

       gmean_atomic_mass  wtd_gmean_atomic_mass  entropy_atomic_mass  \
0              66.361592              36.116612             1.181795   
1              73.132787   

In [14]:
print("What information does each file contain?")
print("\ntrain.csv provides the different features and descriptions of the material.") 
print("\nunique_m.csv provides information on the composition of the materials of interest.")

What information does each file contain?

train.csv provides the different features and descriptions of the material.

unique_m.csv provides information on the composition of the materials of interest.


### Part 3
Visualization and exploration of data 