# Krittika Convener Selection
## Python Assignment

1. You may find the KSP tutorials useful: https://github.com/krittikaiitb/tutorials - Tutorials 1,2,3, and 4 are particularly relevant. These cover basic python, numpy, functions in python and matplotlib respectively.

2. A helpful reminder that executing a cell with help (for example: help(np.loadtxt) or np.loadtxt?) will show the documentation for that function.

3. The use of internet is completely ALLOWED for solving this assignment.

4. Feel free to use multiple cells for your solutions, this would make your code easier to understand in a step-by-step fashion. But keep them separate for each question (don't use a cell to solve Q1 after Q2).

5. Try to keep your code neat and make use of comments and/or markdown cells to explain what you have done.

In [2]:
import csv
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
from math import e
from datetime import date, timedelta

### Q1: Parsing Form Responses
Suppose that you are a convener and it has been a few months into your tenure. We have organized an event focused at the newly joining freshers(your juniors).

We floated a Google form to collect their details and get them registered. We planned to contact them all via WebMail to send them details of the event.

While going through the responses, you discover that your co-convener forgot to filter inputs in the form! There appear to be many invalid roll numbers - we cannot contact these students via WebMail. Here are the first few entries:

| Sr. No. | Name | Roll Number | Contact Number |
|---------|------|-------------|----------------|
| 1       | MV   | 220070044   | 986937546      |
| 2       | DV   | 22b280013   | 961101307      |
| 3       | RR   | 21070042    | 908204532      |
| 4       | YB   | 220030019   | 947226579      |

As you can see, we can already see an erroneous LDAP in the $3$rd input.

Your task here is to find out the submissions with wrong roll numbers and filter them out. We would reach out to such people using their contact numbers. Your final output should be the names and contact numbers of these people.

*PS : As you might suspect, this data is sourced from an actual event from our tenure. It has been anonymized and the errors have been exaggerated :)*

In [None]:
file1 = 'Dataset_Q1.csv' #this is the CSV file that countains all the responses

Feel free to use any libraries/standard functions that you might need to solve this problem.

In [None]:
with open("Dataset_Q1.csv","r",newline="\n") as f:    #if newline is not specified an empty list is added after each row of data
    r = csv.reader(f)
    for i in r:
        if i[1]=='Name':    #excluding the heading line
            continue
        if len(i[2].strip())!=9:     #if roll number does not have 9 digits
            print("Name: ",i[1],"  Contact: ",i[3])
        elif int(i[2][0:2]) > 22:    #if roll number has admission year > 2022
            print("Name: ",i[1],"  Contact: ",i[3])


#### Bonus part:
Amongst the valid entries, what proportion are actually freshers? Remember that we intended to target them with this event. The majority seem to be freshers but you will also find some second and third year students. You can identify each of these groups by the first two digits of their roll numbers. 

Your task is to graphically depict the number of applicants across the three batches.

In [None]:
numbers = [0,0,0]
with open("Dataset_Q1.csv","r",newline="\n") as f:
    r = csv.reader(f)
    for i in r:
        if len(i[2].strip())!=9:
            continue
        elif int(i[2][0:2]) > 22:   #excluding the invalid entries
            continue
        else:
            numbers[(22 - int(i[2][0:2]))] += 1    #freshies: first two digits are 22, so numbers[0] is incremented. For sophies, numbers[1] and so on.

tags = ['Freshies', 'Sophies', 'Thirdies']
plt.bar(tags, numbers, width = 0.5)
plt.ylabel("Number registered")
for i, v in enumerate(numbers):
    plt.text(i-0.05, v+0.1, str(v))

plt.show()


### Q2: A New Discovery
During one of our regular stargazing sessions, you and your co-conveners discover a new blip of light that shouldn't be there. After examining it a bit, you realise that this object is not quite like anything the world has seen before. You share your data with club seniors and make a startling find - its the first of its kind of a completely new class of objects. An ex-secy of the club, Siddhant Tripathy, analyses it extensively and declares that its actually the first ever **endoplanet** to be found. You and your team are now international celebrities, but its time to organise an event so that people from insti can see this.

Your task is to find out when exactly Tripps' endoplanet would be at its brightest and organise a stargazing session on that date so that everyone can see it for themselves. You have data from a month of observations of this object and you need to extrapolate it to find the peak.

In [None]:
file2 = 'Dataset_Q2.csv'

According to your analysis, this object is in a special orbit that gives it a roughly Gaussian light curve i.e. the plot of [magnitude](https://en.wikipedia.org/wiki/Apparent_magnitude) v/s time roughly follows an inverted Gaussian function. Recall that a generic Gaussian function with unit amplitude is given by

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left({-\frac{1}{2}{\left(\frac{x-\mu}{\sigma}\right)}^2}\right)$$ 

where $\mu$ is the mean of the distribution it describes and $\sigma$ is the standard deviation. More about it [here](https://archive.lib.msu.edu/crcmath/math/math/g/g087.htm).

Your task is to find the date at which the comet will be at it's brightest, along with how bright its expected to be. Also plot the original data along with the fitted curve.

You can do this by fitting a gaussian to the light curve data and locating its extremum. You may find `scipy.optimize.curve_fit` useful. 

**Important** : Our fit function must be a Gaussian with a vertical offset. The problem is that `curve_fit` tends to misbehave in this particular example when you ask it to guess that offset, so assume it to be $9.0$ to solve this problem. This, of course, implies that the baseline magnitude of the object is $9.0$

In [None]:
mags = []
with open("Dataset_Q2.csv","r",newline = '\n') as f:
  r = csv.reader(f)
  for i in r:
    if i[2]=='Magnitude':    #excluding heading row
      continue
    mags.append(float(i[2][0:4]))   #original data is in string, first four characters store magnitude, like 8.86

t = [i for i in range(1,31)]   #time in days

def g(x, s, m, a):
  y = 9 - a*e**(-((x-m)/s)**2/2)
  return y

params, covar = curve_fit(g, t, mags)

sd = params[0]
mean = params[1]
amp = params[2]

gy = g(t, sd, mean, amp)

covs = np.sqrt(np.diag(covar))

print ("Day when brightest: ", date(year=2021, month=5, day=12) + timedelta(days=np.round(mean)))
print("Standard deviation in date: ", covs[1])

plt.plot(t, mags, 'o-', label='data')
plt.plot(t, gy, '-', label='fit')
plt.xlabel("Days since first observed")
plt.ylabel("Apparent Magnitude")
plt.legend()


#### Bonus part:

Can this date be trusted? We wouldn't want to claim the comet is the brightest on a particular day and then have it brighten up even more later. Try to ascertain the error in this predicted date. Read the documentation of `curve_fit` and try to understand the statistical significance of the quantities it returns.

In [None]:
# Included in the previous cell
