# Pandas Basic Operations

In this notebook, we'll be working with a simple dataset on fictional book sales to practice some basic pandas operations. Let's get started!

## Step 1: Setup

*You may ignore the code in this cell. You must still execute it.*

The cell below creates the dataset.

In [1]:
# Sample data for demonstration.

data = """Title,Author,Genre,Price,Sold
The Great Tale,John Doe,Fiction,15,1500
Mystery of the Night,Jane Smith,Mystery,20,890
Learning Pandas,Alice Johnson,Education,25,340
Journey to the Stars,Bob Brown,Sci-fi,18,2200
History Repeats,Peter G.,History,22,500
"""

with open("book_sales.csv", "w") as file:
    file.write(data)

## Step 2: Reading the Dataset

We will start by loading our dataset. This dataset contains the sales data of fictional books.


In [2]:
import pandas

df = pandas.read_csv("book_sales.csv")

df.head()

Unnamed: 0,Title,Author,Genre,Price,Sold
0,The Great Tale,John Doe,Fiction,15,1500
1,Mystery of the Night,Jane Smith,Mystery,20,890
2,Learning Pandas,Alice Johnson,Education,25,340
3,Journey to the Stars,Bob Brown,Sci-fi,18,2200
4,History Repeats,Peter G.,History,22,500


Alternatively, you can also read `.xlsx` files using the command `pd.read_excel('excel_file.xlsx', sheet_name='sheet_name')`

Writing `pandas.read_csv` can be impractical. An alternative is to create an abbreviation for the pandas library. You can do this with `import pandas as pd`. Try now to load the dataset using this abbreviation.

In [3]:
import pandas as pd



## Step 3: Describing the Data

Let's get a quick description of our dataset using the `describe()` function.


In [4]:
df.describe()

Unnamed: 0,Price,Sold
count,5.0,5.0
mean,20.0,1086.0
std,3.807887,766.602896
min,15.0,340.0
25%,18.0,500.0
50%,20.0,890.0
75%,22.0,1500.0
max,25.0,2200.0


## Step 4: Exploring Columns and Rows

We can easily see the columns in our dataframe and extract or remove them.


In [5]:
df.columns

Index(['Title', 'Author', 'Genre', 'Price', 'Sold'], dtype='object')

In [6]:
prices = df["Price"]
prices

0    15
1    20
2    25
3    18
4    22
Name: Price, dtype: int64

You can also select multiple columns from the dataframe. For example, the following command selects the columns `Title` and `Author`.

In [7]:
df[["Title", "Author"]]

Unnamed: 0,Title,Author
0,The Great Tale,John Doe
1,Mystery of the Night,Jane Smith
2,Learning Pandas,Alice Johnson
3,Journey to the Stars,Bob Brown
4,History Repeats,Peter G.


In [8]:
df_without_sold = df.drop(columns=["Sold"])
df_without_sold

Unnamed: 0,Title,Author,Genre,Price
0,The Great Tale,John Doe,Fiction,15
1,Mystery of the Night,Jane Smith,Mystery,20
2,Learning Pandas,Alice Johnson,Education,25
3,Journey to the Stars,Bob Brown,Sci-fi,18
4,History Repeats,Peter G.,History,22


We can also retrieve rows from a dataset. The following command selects the first 3 rows of the dataset.

In [9]:
df[0:3]

Unnamed: 0,Title,Author,Genre,Price,Sold
0,The Great Tale,John Doe,Fiction,15,1500
1,Mystery of the Night,Jane Smith,Mystery,20,890
2,Learning Pandas,Alice Johnson,Education,25,340


Note that rows are numbered as 0, 1, 2, 3, ... In general, the command `df[a:b]`, for `a`, `b` natural numbers, selects columns `a`, `a+1`, `a+2`, `...`, `b-1` of the dataset. How can you select columns 2 and 3 from your dataset?

In [12]:
df[2:4]

Unnamed: 0,Title,Author,Genre,Price,Sold
2,Learning Pandas,Alice Johnson,Education,25,340
3,Journey to the Stars,Bob Brown,Sci-fi,18,2200


## Step 5: Sorting Rows in a DataFrame

We can sort the rows in our dataframe based on any feature.

In [13]:
df_sorted_by_price = df.sort_values(by="Price")
df_sorted_by_price

Unnamed: 0,Title,Author,Genre,Price,Sold
0,The Great Tale,John Doe,Fiction,15,1500
3,Journey to the Stars,Bob Brown,Sci-fi,18,2200
1,Mystery of the Night,Jane Smith,Mystery,20,890
4,History Repeats,Peter G.,History,22,500
2,Learning Pandas,Alice Johnson,Education,25,340


## Step 6: Compute revenue

In [14]:
df["Revenue"] = df["Price"] * df["Sold"]
df

Unnamed: 0,Title,Author,Genre,Price,Sold,Revenue
0,The Great Tale,John Doe,Fiction,15,1500,22500
1,Mystery of the Night,Jane Smith,Mystery,20,890,17800
2,Learning Pandas,Alice Johnson,Education,25,340,8500
3,Journey to the Stars,Bob Brown,Sci-fi,18,2200,39600
4,History Repeats,Peter G.,History,22,500,11000


## Step 7: Top 3 bestsellers

Now you can apply what you have learned in this notebook. Compute the top 3 bestsellers.

In [23]:
df_best_seller = df.sort_values(ascending=False, by="Revenue")
df_best_seller[0:3]

Unnamed: 0,Title,Author,Genre,Price,Sold,Revenue
3,Journey to the Stars,Bob Brown,Sci-fi,18,2200,39600
0,The Great Tale,John Doe,Fiction,15,1500,22500
1,Mystery of the Night,Jane Smith,Mystery,20,890,17800
