# Part 1: Python Coding

The goal of this assignment is to assess the applicant's coding skills in Python. We will be evaluating based on the following criteria, in order of importance: 

1. Correctness of the solution (i.e. does it pass the tests)
2. Readability of the solution (i.e. can we understand the logic behind your solution, is your code readable)
3. Execution speed (i.e. is your solution quite fast or really slow)
3. Efficient usage of Python builtins and libraries. (i.e. is your solution "pythonic")

**Task**: Please complete the functions that solve the following problems.

### Problem 1: Palindrome rearranging

Given a string, find out if its characters can be rearranged to form a palindrome.

**Example**

For `inputString = "aabb"`, the output should be
`palindromeRearranging(inputString) = true`.

We can rearrange `"aabb"` to make `"abba"`, which is a palindrome.

**Comments **

The function will be tested also on a larger list of test cases. Feel free to include any other test case in the tests below, if you want to test your algorithm against a case not included in the list.

In [None]:
def palindromeRearranging(inputString):
    return NotImplemented

In [None]:
tests = ["aabb", "acab", "dkfhakc", "odkirteikrtod", "ccddd", "ccdddd", "cd", "eeeeeee", "tacocat", "-a--a"]
solutions = [True, False, False, True, True, True, False, True, True, True]

In [None]:
for t, s in zip(tests, solutions):
    if palindromeRearranging(t) != s:
        print('Error in test %s' % t)
        break
else:
    print('All tests passed.')

### Problem 2: Parsing virtual docking results

Under the directory `./files_to_parse` there are files containing results from a virtual docking experiment. In the experiment, several small molecules where docked with a target (protein) of interest, using a software called `Smina`. Within each output file, several results (modes) are provided per molecule, as a table in text format. For example, the file `example.txt`, contains:

```
[text we don't care about]

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1       -5.4       0.000      0.000
2       -5.4       20.469     21.598
3       -5.4       5.526      7.144
4       -5.2       6.959      7.734
5       -5.2       2.251      2.985
6       -5.1       1.412      1.608
7       -4.9       2.079      2.810
8       -4.9       2.921      3.692
9       -4.8       6.995      8.018
```

To be able to analyze the results, we would like to parse these files and consolidate the results in a single, comma-separated value file with the following columns: `[filename, mode, affinity, rmsd_lb, rmsd_ub]`. For example, the entries for the `example.txt` file would be:

```
filename, mode, affinity, rmsd_lb, rmsd_ub
example.txt, 1, -5.4, 0.000, 0.000
example.txt, 2, -5.4, 20.469, 21.598  
example.txt, 3, -5.4, 5.526, 7.144 
example.txt, 4, -5.2, 6.959, 7.734
example.txt, 5, -5.2, 2.251, 2.985
example.txt, 6, -5.1, 1.412, 1.608
example.txt, 7, -4.9, 2.079, 2.810
example.txt, 8, -4.9, 2.921, 3.692
example.txt, 9, -4.8, 6.995, 8.018
```

Your tasks are to:
1. Write a python function that parses all the files in the directory `files_to_parse` and saves the data as a single CSV file named `parsed_data.csv`.

2. Plot the distribution of the minimum affinities across the different molecules (files). Does it match any standard statistical distribution?

3. In the file `data/important_molecules.txt` there is list of 20 molecules that are of particular importance in the analysis of the specific target. We have a hypothesis that the affinities of the different modes for each of the important molecules are drawn from the same distribution. Please test this hypothesis.

**Comments**
* Suggested libraries: `NumPy, SciPy, Pandas, Matplotlib, Seaborn`.

In [None]:
# Please solve the problem here in as many cells needed.

# Part 2: Machine Learning 

The goal of this assignment is to assess the applicant's expertise in solving problems with machine learning using Python. We will be evaluating the solutions by the following criteria (all equally important):

* Data selection/preprocessing and train/test split sanity
* Model evaluation pipeline. We are not looking for a _correct_ model, rather than a sound evaluation of different models or parameter combinations.
* Explanation of the reasons behind the final model selection.
* Model accuracy on a held-out test set. The test set data are held out by us and not provided, so feel free to use all available data for training/validation.  



### Problem 1: Pokemon Type Classification

Unlike the popular game series typical quest to catch them all, here your objective is to predict the type of the pokemon based on its stats. 

**Data**: In the file `data/Pokemon.csv`, you can find the relevant data for 634 Pokemon. 
The column to be used as an output, is `Type 1`. You can use any combination of the other columns as an input.

**Evaluation**: Your model will be evaluated on the multi-class classification problem of predicting the `Type 1` label.  


In [None]:
# Please solve the problem here in as many cells needed.