## Pandas
- To play with tabular data, (Rows $\times$ Columns)

In [1]:
import pandas as pd
import numpy as np

Creating a dataframe manually:

In [3]:
user_data = {
    "MarksA": np.random.randint(50, 100, 10), 
    "MarksB": np.random.randint(50, 100, 10),
    "MarksC": np.random.randint(50, 100, 10),
}

user_data

{'MarksA': array([57, 52, 70, 89, 62, 61, 90, 58, 68, 85]),
 'MarksB': array([67, 68, 85, 64, 60, 96, 83, 91, 58, 75]),
 'MarksC': array([88, 94, 77, 79, 81, 88, 90, 53, 67, 99])}

To make this into a `dataframe`
- `DataFrame` attribute

In [4]:
df = pd.DataFrame(user_data)
df

Unnamed: 0,MarksA,MarksB,MarksC
0,57,67,88
1,52,68,94
2,70,85,77
3,89,64,79
4,62,60,81
5,61,96,88
6,90,83,90
7,58,91,53
8,68,58,67
9,85,75,99


In [5]:
df.head() #Gives the top 5, unless defined

Unnamed: 0,MarksA,MarksB,MarksC
0,57,67,88
1,52,68,94
2,70,85,77
3,89,64,79
4,62,60,81


In [6]:
df.columns #Shows the columns

Index(['MarksA', 'MarksB', 'MarksC'], dtype='object')

#### `to_csv`
- Makes a csv file from the dataframe
- Give `index=False` to avoid saving indexes

In [7]:
df.to_csv("marks.csv", index=False)

#### `read_csv`
- Reads the data from csv and saves in a **DataFrame**

In [8]:
user_df = pd.read_csv("marks.csv")
user_df

Unnamed: 0,MarksA,MarksB,MarksC
0,57,67,88
1,52,68,94
2,70,85,77
3,89,64,79
4,62,60,81
5,61,96,88
6,90,83,90
7,58,91,53
8,68,58,67
9,85,75,99


We see we get the same dataset we saved

#### `describe()`
- Describes the dataset
- Gives some basic metrics like:
  - mean
  - max
  - std
  - count

In [10]:
user_df.describe()

Unnamed: 0,MarksA,MarksB,MarksC
count,10.0,10.0,10.0
mean,69.2,74.7,81.6
std,14.006348,13.367041,13.615351
min,52.0,58.0,53.0
25%,58.75,64.75,77.5
50%,65.0,71.5,84.5
75%,81.25,84.5,89.5
max,90.0,96.0,99.0


#### `iloc`
- Stands for **Integer Location**
- Used to locate rows and columns we require
- iloc[rows, [columns]] (Supports slicing)

In [11]:
user_df.iloc[:3, [1,2]]

Unnamed: 0,MarksB,MarksC
0,67,88
1,68,94
2,85,77


Here we get the first 3 rows, and the 2nd and 3rd column (0 - indexed)

#### `columns.get_loc()`
- Locates the column and returns the index
- Useful when we have to give a column as a name to iloc, and we dont have the index

In [12]:
idx = user_df.columns.get_loc("MarksB")
idx

1

In [13]:
user_df.iloc[:3, idx]

0    67
1    68
2    85
Name: MarksB, dtype: int64

### Sorting dataset
- `sort_values()`
  - Sorts by values
  - We give a list of columns, priortises by index, if clash in 0th, then proceeds to 1st and so on

In [14]:
user_df.sort_values(by=["MarksC", "MarksA"], ascending=False)

Unnamed: 0,MarksA,MarksB,MarksC
9,85,75,99
1,52,68,94
6,90,83,90
5,61,96,88
0,57,67,88
4,62,60,81
3,89,64,79
2,70,85,77
8,68,58,67
7,58,91,53


#### `values`
- Gives a numpy array of the values in this dataframe

In [15]:
user_values = user_df.values
print(user_values)

[[57 67 88]
 [52 68 94]
 [70 85 77]
 [89 64 79]
 [62 60 81]
 [61 96 88]
 [90 83 90]
 [58 91 53]
 [68 58 67]
 [85 75 99]]


In [16]:
type(user_values)

numpy.ndarray

Making a DF out of a numpy array:

In [17]:
new_df = pd.DataFrame(user_values, dtype="int32", columns=["Physics", "Chemistry", "Maths"])
new_df

Unnamed: 0,Physics,Chemistry,Maths
0,57,67,88
1,52,68,94
2,70,85,77
3,89,64,79
4,62,60,81
5,61,96,88
6,90,83,90
7,58,91,53
8,68,58,67
9,85,75,99


In [18]:
new_df.to_csv("PCM.csv", index=False)

Now we have created a new df, from a numpy array and saved in the file