## Introduction to <span style="color: green;">Pandas</span>

**Pandas** is a powerful Python library for data manipulation and analysis. It provides high-performance data structures and data analysis tools that make working with structured data easy and efficient.

**Pandas Provides two types of classes for handling data:**

* **Series:**
    * A one-dimensional labeled array capable of holding any data type (integers, floats, strings, objects, etc.).
    * Each element in a Series is associated with a label, which can be any immutable object (e.g., strings, integers, or tuples).
    * Series objects are often created from Python lists or NumPy arrays.
* **DataFrame:**
    * A two-dimensional labeled data structure with rows and columns.
    * Each column in a DataFrame is a Series object, and each row is a dictionary-like object.
    * DataFrames can be created from various sources, such as Python dictionaries, lists of lists, or NumPy arrays.

**Installation Process:**

* **Using pip:** pip install pandas
* **Using conda:** conda install pandas

* For more help with installation: https://pandas.pydata.org/docs/getting_started/install.html

<span style="color: orange;">Lets begin with practice</span>
* Leetcode Practice Link: https://leetcode.com/studyplan/introduction-to-pandas/

***

### Q1 Create a Database from List                                                                                
### Difficulty: <span style="color: red;">Easy</span>
Write a solution to create a DataFrame from a 2D list called <span style="color: green;">student_data</span>. This 2D list contains the IDs and ages of some students.
The DataFrame should have two columns, <span style="color: green;">student_id</span> and <span style="color: green;">age</span>, and be in the same order as the original 2D list.
The result format is in the following example.

**Example Input:**

```python
student_data = [
  [1, 15],
  [2, 11],
  [3, 11],
  [4, 20]
]

**Output:**

| student_id | age |
|---|---|
| 1 | 15 |
| 2 | 11 |
| 3 | 11 |
| 4 | 20 |


Explanation: A DataFrame was created on top of student_data, with two columns named <span style="color: green;">student_id</span> and <span style="color: green;">age</span>.

In [2]:
#Solution
import pandas as pd
student_data=[[1, 15], [2, 11], [3, 11], [4, 20]]
df=pd.DataFrame(student_data, columns=['student_id', 'age'])
print(df.head())

   student_id  age
0           1   15
1           2   11
2           3   11
3           4   20


***

### Q2 Get the Size of a DataFrame                                                                               
### Difficulty: <span style="color: red;">Easy</span>

* DataFrame <span style="color: green;">players</span>

| Column Name | Type |
|---|---|
| player_id | int |
| name | object |
| age | int |
| position | object |
| ... | ... |




Write a solution to calculate and display the number of rows and columns of <span style="color: green;">players</span>.<br>
Return the result as an array:<br>
<span style="color: green;">[number of rows, number of columns]</span>

**Example Input:**

| player_id | name     | age | position    | team               |
|-----------|----------|-----|-------------|--------------------|
| 846       | Mason    | 21  | Forward     | RealMadrid         |
| 749       | Riley    | 30  | Winger      | Barcelona          |
| 155       | Bob      | 28  | Striker     | ManchesterUnited   |
| 583       | Isabella | 32  | Goalkeeper  | Liverpool          |
| 388       | Zachary  | 24  | Midfielder  | BayernMunich       |
| 883       | Ava      | 23  | Defender    | Chelsea            |
| 355       | Violet   | 18  | Striker     | Juventus           |
| 247       | Thomas   | 27  | Striker     | ParisSaint-Germain |
| 761       | Jack     | 33  | Midfielder  | ManchesterCity     |
| 642       | Charlie  | 36  | Center-back | Arsenal            |

**Output:**
[10, 5]

**Explanation:**
This DataFrame contains 10 rows and 5 columns.

In [10]:
#Solution
import pandas as pd
players=[
    ["player_id", "name", "age", "position", "team"],
    [846, "Mason", 21, "Forward", "RealMadrid"],
    [749, "Riley", 30, "Winger", "Barcelona"],
    [155, "Bob", 28, "Striker", "ManchesterUnited"],
    [583, "Isabella", 32, "Goalkeeper", "Liverpool"],
    [388, "Zachary", 24, "Midfielder", "BayernMunich"],
    [883, "Ava", 23, "Defender", "Chelsea"],
    [355, "Violet", 18, "Striker", "Juventus"],
    [247, "Thomas", 27, "Striker", "ParisSaint-Germain"],
    [761, "Jack", 33, "Midfielder", "ManchesterCity"],
    [642, "Charlie", 36, "Center-back", "Arsenal"]
]

df=pd.DataFrame(players[1:], columns=players[0]) #Here we used slicing so that the first row is considered as the column headers and every row from index 1 is our actual data
print(df.head())

   player_id      name  age    position              team
0        846     Mason   21     Forward        RealMadrid
1        749     Riley   30      Winger         Barcelona
2        155       Bob   28     Striker  ManchesterUnited
3        583  Isabella   32  Goalkeeper         Liverpool
4        388   Zachary   24  Midfielder      BayernMunich


In [12]:
print(list(df.shape))  #We needed our output as a list and df.shape gives us the output as a tuple which is why we used the list method to convert this tuple to list

[10, 5]


***

### Q3 Display the First Three Rows
### Difficulty: <span style="color: red;">Easy</span>
Write a solution to display the first <span style="color: green;">3</span> rows of this DataFrame.

**Example Input:**
* DataFrame: <span style="color: green;">employees</span>

| employee_id | name      | department            | salary |
| ----------- | --------- | --------------------- | ------ |
| 3           | Bob       | Operations            | 48675  |
| 90          | Alice     | Sales                 | 11096  |
| 9           | Tatiana   | Engineering           | 33805  |
| 60          | Annabelle | InformationTechnology | 37678  |
| 49          | Jonathan  | HumanResources        | 23793  |
| 43          | Khaled    | Administration        | 40454  |

**Output:**
| employee_id | name      | department            | salary |
| ----------- | --------- | --------------------- | ------ |
| 3           | Bob       | Operations            | 48675  |
| 90          | Alice     | Sales                 | 11096  |
| 9           | Tatiana   | Engineering           | 33805  |

**Explanation:**
Only the first 3 rows are displayed.

In [7]:
#Solution
import pandas as pd
employees=[
    ["employee_id", "name", "department", "salary"],
    [3, "Bob", "Operations", 48675],
    [90, "Alice", "Sales", 11096],
    [9, "Tatiana", "Engineering", 33805],
    [60, "Annabelle", "InformationTechnology", 37678],
    [49, "Jonathan", "HumanResources", 23793],
    [43, "Khaled", "Administration", 40454]
]
df=pd.DataFrame(employees[1:], columns=employees[0])
print(df.head())

   employee_id       name             department  salary
0            3        Bob             Operations   48675
1           90      Alice                  Sales   11096
2            9    Tatiana            Engineering   33805
3           60  Annabelle  InformationTechnology   37678
4           49   Jonathan         HumanResources   23793


In [5]:
#To print the first particular number of rows, simply define the head argument with the nth element (in our case 3)
df.head(3)

Unnamed: 0,employee_id,name,department,salary
0,3,Bob,Operations,48675
1,90,Alice,Sales,11096
2,9,Tatiana,Engineering,33805


***

### Q4 Select Data
### Difficulty: <span style="color: red;">Easy</span>
Write a solution to select the name and age of the student with <span style="color: green;">student_id = 101</span>.

* DataFrame <span style="color: green;">students</span>

| Column Name | Type |
|---|---|
| student_id | int |
| name | object |
| age | int |

**Example Input:**

| student_id | name    | age |
| ---------- | ------- | --- |
| 101        | Ulysses | 13  |
| 53         | William | 10  |
| 128        | Henry   | 6   |
| 3          | Henry   | 11  |

**Output:**

| name    | age |
| ------- | --- |
| Ulysses | 13  |

**Explaination:**
Student Ulysses has student_id = 101, we select the name and age.

In [9]:
#Solution
import pandas as pd
students=[
    ["student_id", "name", "age"],
    [101, "Ulysses", 13],
    [53, "William", 10],
    [128, "Henry", 6],
    [3, "Henry", 11]
]
df=pd.DataFrame(students[1:], columns=students[0])
print(df.head())

   student_id     name  age
0         101  Ulysses   13
1          53  William   10
2         128    Henry    6
3           3    Henry   11


In [43]:
# We use .iloc which helps up in slicing the DataFrame based on rows and columns, the first parameter for iloc is for rows while the second is for column.
# ":" A semicolon means selecting every row in the first parameter and for column we needed columns name and age which are at index 1 and 2 respectively so
# the second parameter we start slicing from index 1 till the end
print(df[df["student_id"]==101].iloc[:,1:])

      name  age
0  Ulysses   13


***

### Q5 Create a New Column
### Difficulty: <span style="color: red;">Easy</span>
* DataFrame <span style="color: green;">employees</span>

| Column Name | Type |
|---|---|
| name | object |
| salary | int |

A company plans to provide its employees with a bonus.<br>
Write a solution to create a new column name <span style="color: green;">bonus</span> that contains the <strong>doubled values</strong> of the <span style="color: green;">salary</span> column.

**Example Input:**
| name    | salary |
| ------- | ------ |
| Piper   | 4548   |
| Grace   | 28150  |
| Georgia | 1103   |
| Willow  | 6593   |
| Finn    | 74576  |
| Thomas  | 24433  |

**Output:**
| name    | salary | bonus |
| ------- | ------ | ------ |
| Piper   | 4548   | 9096 |
| Grace   | 28150  | 56300 |
| Georgia | 1103   | 2206 |
| Willow  | 6593   | 13186 |
| Finn    | 74576  | 149152 |
| Thomas  | 24433  | 48866 |

**Explanation:**
A new column bonus is created by doubling the value in the column salary.

In [44]:
#Solution
import pandas as pd
employees=[
    ["name", "salary"],
    ["Piper", 4548],
    ["Grace", 28150],
    ["Georgia", 1103],
    ["Willow", 6593],
    ["Finn", 74576],
    ["Thomas", 24433]
]
df=pd.DataFrame(employees[1:], columns=employees[0])
print(df.head())

      name  salary
0    Piper    4548
1    Grace   28150
2  Georgia    1103
3   Willow    6593
4     Finn   74576


In [45]:
df["bonus"]=df["salary"]*2 #We simply initiate a new value and its value to double of salary column.
print(df.head())

      name  salary   bonus
0    Piper    4548    9096
1    Grace   28150   56300
2  Georgia    1103    2206
3   Willow    6593   13186
4     Finn   74576  149152


***

### Q6 Drop Duplicate Columns
### Difficulty: <span style="color: red;">Easy</span>
* DataFrame <span style="color: green;">customers</span>

| Column Name | Type |
|---|---|
| customer_id | int |
| name | object |
| email | object |

There are some duplicate rows in the DataFrame based on the <span style="color: green;">email</span> column.<br>
Write a solution to remove these duplicate rows and keep only the <strong>first</strong> occurrence.<br>
The result format is in the following example.

**Example Input:**
| customer_id | name    | email               |
| ----------- | ------- | ------------------- |
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |

**Output:**
| customer_id | name    | email               |
| ----------- | ------- | ------------------- |
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |

**Explanation:**
Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.

In [59]:
#Solution
import pandas as pd
customers=[
    ["customer_id", "name", "email"],
    [1, "Ella", "emily@example.com"],
    [2, "David", "michael@example.com"],
    [3, "Zachary", "sarah@example.com"],
    [4, "Alice", "john@example.com"],
    [5, "Finn", "john@example.com"],
    [6, "Violet", "alice@example.com"]
]
df=pd.DataFrame(customers[1:], columns=customers[0])
print(df.head())

   customer_id     name                email
0            1     Ella    emily@example.com
1            2    David  michael@example.com
2            3  Zachary    sarah@example.com
3            4    Alice     john@example.com
4            5     Finn     john@example.com


In [60]:
new_df=df.drop_duplicates("email", keep='first')
print(new_df)

   customer_id     name                email
0            1     Ella    emily@example.com
1            2    David  michael@example.com
2            3  Zachary    sarah@example.com
3            4    Alice     john@example.com
5            6   Violet    alice@example.com


***