# Data Science

Python `list` is simple and good tool to process simple and small dataset. For large dataset, [NumPy](https://numpy.org/) has much better performance than the Python list data type. However, real world data science requires more features such as friendly index names, more data types and more functions. Based on Numpy, [Pandas](https://pandas.pydata.org/) is a fast, powerful and easy to use data analysis tool. This section creates a data analysis project using Pandas.


## 1 The Task

Suppose that you are a teacher. By the end of a semester, you want to analyze student scores and grade the class. You want to understand the statistic description of the scores. Additionally, you want to grade the class based on the following rules:

- A student get an `A` if his average score is higer or equal `90`.
- A student get an `B` if his average score is higer or equal `80`.
- A student get an `C` if his average score is higer or equal `70`.
- Otherwise, a student get a `D`.

## 2 Load Data

Large datasets are usually stored in files or in databases. For simplicity, this project load data from csv file. A [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) file is a text file that uses commas to separate values. You can export Excel data or database data to a csv file. in The data file used here is a csv file named `scores.csv`, it is in the current folder. The csv file is exported from an Excel file named `scores.xlsx` that is also include in the current folder.

In [7]:
import pandas as pd

scores = pd.read_csv('scores.csv', header=0, index_col=0)
print(scores)

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85
Wally    int64
Eva      int64
Sam      int64
Katie    int64
Bob      int64
dtype: object


In the above code, we set the `header=0` and `index_col=0` to specify the column names and index names. You can use the VSCode Data Viewer to check that it reads the correct data. To compare with built-in `list` data type and NumPy's `ndarray`, having column names and index names really helps the data analysis.

## 2 Data Description

Pandas use `Serie` for one-dimensional array and `DataFrame` for two-dimenional array. Both have a `describe` method that computes basic descriptive statistics.

In [8]:
# to get descriptive statistics for each student
scores.describe()

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
count,3.0,3.0,3.0,3.0,3.0
mean,84.333333,92.333333,87.0,87.666667,77.666667
std,13.203535,6.806859,8.888194,10.692677,11.015141
min,70.0,87.0,77.0,81.0,65.0
25%,78.5,88.5,83.5,81.5,74.0
50%,87.0,90.0,90.0,82.0,83.0
75%,91.5,95.0,92.0,91.0,84.0
max,96.0,100.0,94.0,100.0,85.0


In [9]:
# to get descriptive statistics for each test
scores.T.describe()

Unnamed: 0,Test1,Test2,Test3
count,5.0,5.0,5.0
mean,92.8,81.2,83.4
std,7.661593,11.54123,8.234076
min,83.0,65.0,70.0
25%,87.0,77.0,82.0
50%,94.0,81.0,85.0
75%,100.0,87.0,90.0
max,100.0,96.0,90.0


## 3 Assigning Grade

What we want is to calculate a new dataframe that assign grade `A`, `B`, `C`, and `D` based on each student's mean score. First, you can get everyone's mean score using the `mean()` method.

In [12]:
means = scores.mean()
print(means)

Wally    84.333333
Eva      92.333333
Sam      87.000000
Katie    87.666667
Bob      77.666667
dtype: float64


Then we define a function that convert the scores to corresponding letter grade.

In [13]:
def assign_grade(score):
    if score >= 90:
        letter = 'A'
    elif score >= 80:
        letter = 'B'
    elif score >= 70:
        letter = 'C'
    else:
        letter = 'D'
    
    return letter

Then you transform the `means` to letter grades.

In [16]:
grades = means.transform(assign_grade)
print(grades)

Wally    B
Eva      A
Sam      B
Katie    B
Bob      C
dtype: object


You can print it in the sorted order by student names.

In [17]:
grades.sort_index()

Bob      C
Eva      A
Katie    B
Sam      B
Wally    B
dtype: object