In [1]:
import pandas as pd
import numpy as np

## Questions:
1. **array**: Create a Pandas DataFrame with a single column containing numbers 1-100, and then convert the DataFrame to a NumPy array.
3. **concat**: Create two Pandas DataFrames with the same columns but different data. Use the concat function to combine the two DataFrames into one DataFrame.
4. **cut**: Load a dataset of customer heights in inches and use the cut function to bin the heights into different categories (e.g. "short", "average", "tall").
5. **date_range**: Use the date_range function to generate a range of dates starting from today and ending in 30 days, and create a Pandas DataFrame with a single column containing the dates.
6. **eval**: Load a dataset of stock prices and use the eval method to calculate the returns (percentage change) for each day.
7. **get_dummies**: Load a dataset of customer genders and use the get_dummies function to one-hot encode the genders.
8. **infer_freq**: Load a dataset of daily temperatures and use the infer_freq function to automatically infer the frequency of the data (e.g. daily, weekly, monthly).
9. **interval_range**: Use the interval_range function to generate a range of intervals with a specified start, end, and step, and create a Pandas DataFrame with a single column containing the intervals.
10. **isna**: Load a dataset with missing values and use the isna method to identify the missing values.
11. **isnull**: Load a dataset with missing values and use the isnull method to count the number of missing values.
12. **merge**: Load two datasets with different columns but related data and use the merge function to combine the datasets into one DataFrame.
13. **notna**: Load a dataset with missing values and use the notna method to identify the non-missing values.
14. **notnull**: Load a dataset with missing values and use the notnull method to count the number of non-missing values.
15. **pivot_table**: Create a program that takes a pandas dataframe and generates a pivot table from the data, aggregating the values based on specified columns.
16. **plotting**: Create a program that takes a pandas dataframe and creates a bar plot, line plot, and scatter plot of the data.
17. **qcut**: Create a program that takes a pandas series and creates quantile bins from the values, creating a categorical variable from the numerical data.
18. **read_csv**: Create a program that reads a CSV file and displays the first 5 rows of the data.
19. **read_excel**: Create a program that reads an Excel file and displays the first 5 rows of the data.
20. **read_html**: Create a program that reads an HTML table from a website and displays the first 5 rows of the data.
21. **read_json**: Create a program that reads a JSON file and displays the first 5 entries of the data.
22. **read_pickle**: Create a program that reads a pickled file and displays the first 5 rows of the data.
23. **to_datetime**: Create a program that takes a pandas series and converts the values to datetime objects.
24. **to_numeric**: Create a program that takes a pandas series and converts the values to numerical data.
25. **unique**: Create a program that takes a pandas series and returns the unique values in the data.
26. **value_counts**: Create a program that takes a pandas series and returns the frequency counts of each unique value in the data.


In [2]:
# @title Q 1: array
# Create a Pandas DataFrame with a single column containing numbers 1-100
df = pd.DataFrame({'numbers': range(1,101)})

# Convert the Pandas DataFrame to a NumPy array using pd.array
np_array = pd.array(df['numbers'])

print("Pandas DataFrame:")
print(df)

print("NumPy Array:")
print(np_array)

Pandas DataFrame:
    numbers
0         1
1         2
2         3
3         4
4         5
..      ...
95       96
96       97
97       98
98       99
99      100

[100 rows x 1 columns]
NumPy Array:
<PandasArray>
[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,
  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,  42,  43,  44,  45,
  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,
  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,
  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
  91,  92,  93,  94,  95,  96,  97,  98,  99, 100]
Length: 100, dtype: int64


In [3]:
# @title Q 2: concat

# Create the first Pandas DataFrame
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Create the second Pandas DataFrame
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})

# Use the pd.concat function to combine the two DataFrames into one DataFrame
df_concat = pd.concat([df1, df2])

print("First DataFrame:")
print(df1)

print("Second DataFrame:")
print(df2)

print("Concatenated DataFrame:")
print(df_concat)

First DataFrame:
   A  B
0  1  4
1  2  5
2  3  6
Second DataFrame:
   A   B
0  7  10
1  8  11
2  9  12
Concatenated DataFrame:
   A   B
0  1   4
1  2   5
2  3   6
0  7  10
1  8  11
2  9  12


In [14]:
# @title Q 3: cut
# Load a dataset of customer heights in inches
df = pd.DataFrame({'heights': np.random.normal(loc=69, scale=3, size=100)})
# print(df)
# Use the pd.cut function to bin the heights into different categories
height_bins = [60, 65, 70, 75, 80]
height_labels = ['short', 'below average', 'average', 'tall']
height_categories = pd.cut(df['heights'], bins=height_bins, labels=height_labels)
# print(height_categories)
# Add the height categories to the original heights dataset as a new column
df['category'] = height_categories

print(df.head())

     heights       category
0  67.571779  below average
1  70.026581        average
2  70.567569        average
3  71.054596        average
4  66.756706  below average


In [5]:
# @title Q 4: date_range

# Use the date_range function to generate a range of dates starting from today and ending in 30 days
today = pd.Timestamp.today()
dates = pd.date_range(start=today, periods=30)

# Create a Pandas DataFrame with a single column containing the dates
dates_df = pd.DataFrame({'date': dates})

print(dates_df.head())

                        date
0 2023-02-20 12:00:44.921931
1 2023-02-21 12:00:44.921931
2 2023-02-22 12:00:44.921931
3 2023-02-23 12:00:44.921931
4 2023-02-24 12:00:44.921931


In [16]:
# @title Q 5: eval
# Create a toy dataset with 5 rows and 3 columns (Date, Open, and Close)
dates = pd.date_range(start='2023-01-01', periods=5)
opens = np.random.randint(low=100, high=200, size=5)
closes = opens + np.random.randint(low=1, high=10, size=5)
data = {'Date': dates, 'Open': opens, 'Close': closes}
df = pd.DataFrame(data)
# df['C'] = df['Open'] + df['Close']
df['C'] = df.eval("Open + Close")
# print(df)
# Use the eval method to calculate the returns (percentage change) for each day
# df['returns'] = df.eval("(Close - Open) / Open * 100")

print(df.head())

        Date  Open  Close    C
0 2023-01-01   120    129  249
1 2023-01-02   154    158  312
2 2023-01-03   176    181  357
3 2023-01-04   174    179  353
4 2023-01-05   148    150  298


In [7]:
# @title Q 6: get_dummies
# Create a dataset of customer genders
genders = ['Male', 'Female', 'Male', 'Female', 'Other']
customers = pd.DataFrame({'Gender': genders})

# One-hot encode the genders
one_hot_encoded = pd.get_dummies(customers['Gender'])

# Join the one-hot encoded genders back to the original data
customers = pd.concat([customers, one_hot_encoded], axis=1)

# Print the resulting DataFrame
print(customers)

   Gender  Female  Male  Other
0    Male       0     1      0
1  Female       1     0      0
2    Male       0     1      0
3  Female       1     0      0
4   Other       0     0      1


In [8]:
# @title Q 7: infer_freq
# Create a toy dataset of daily temperatures
dates = pd.date_range('2021-01-01', periods=365)
temperatures = np.random.randint(low=0, high=100, size=365)
df = pd.DataFrame({'date': dates, 'temperature': temperatures})
print(df)
# Infer the frequency of the data
freq = pd.infer_freq(df['date'])
print(freq)

          date  temperature
0   2021-01-01           25
1   2021-01-02           41
2   2021-01-03           10
3   2021-01-04           89
4   2021-01-05           23
..         ...          ...
360 2021-12-27           37
361 2021-12-28           17
362 2021-12-29            3
363 2021-12-30           66
364 2021-12-31           23

[365 rows x 2 columns]
D


In [9]:
# @title Q 8: interval_range
# Define the start, end, and step for the intervals
start = 0
end = 10
step = 2

# Generate the intervals using the interval_range function
intervals = pd.interval_range(start=start, end=end, freq=step)

# Create a single-column DataFrame to store the intervals
df = pd.DataFrame({'Interval': intervals})

# Display the resulting DataFrame
print(df)

  Interval
0   (0, 2]
1   (2, 4]
2   (4, 6]
3   (6, 8]
4  (8, 10]


In [10]:
# @title Q 16: qcut
df = pd.DataFrame({'movie_id': range(9), 'rating': [8,1,4,6,3,8,9,3,10]})
print(df)
df['rating_new'] = pd.qcut(df['rating'], 4, duplicates='drop')
print(df)

   movie_id  rating
0         0       8
1         1       1
2         2       4
3         3       6
4         4       3
5         5       8
6         6       9
7         7       3
8         8      10
   movie_id  rating    rating_new
0         0       8    (6.0, 8.0]
1         1       1  (0.999, 3.0]
2         2       4    (3.0, 6.0]
3         3       6    (3.0, 6.0]
4         4       3  (0.999, 3.0]
5         5       8    (6.0, 8.0]
6         6       9   (8.0, 10.0]
7         7       3  (0.999, 3.0]
8         8      10   (8.0, 10.0]


In [11]:
print(pd.qcut.__doc__)


    Quantile-based discretization function.

    Discretize variable into equal-sized buckets based on rank or based
    on sample quantiles. For example 1000 values for 10 quantiles would
    produce a Categorical object indicating quantile membership for each data point.

    Parameters
    ----------
    x : 1d ndarray or Series
    q : int or list-like of float
        Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
        array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
    labels : array or False, default None
        Used as labels for the resulting bins. Must be of the same length as
        the resulting bins. If False, return only integer indicators of the
        bins. If True, raises an error.
    retbins : bool, optional
        Whether to return the (bins, labels) or not. Can be useful if bins
        is given as a scalar.
    precision : int, optional
        The precision at which to store and display the bins labels.
    duplicates