### CONTEXT AND RELEVANCE

#### Which might answer these questions:
- Are there any potential biases in the data?
- If there are biases, what are the risks and implications?


In [23]:
# import library
import pandas as pd

# Relative path to read the CSV file from the "data" folder
df = pd.read_csv(r"./data/bahay-ugnayan/inventories.csv")

# Convert columns to their appropriate data types
df = df.infer_objects()

---

### DATA QUALITY

### Check for missing values

In [24]:
# Check for missing values
null_counts = df.isnull().sum()

print(f"{null_counts}\n-----------\n")

for column, count in null_counts.items():
    if count > 0:
        print(f"{column}: {count} null values")

_id                                   0
accessionId                           0
amoNumber                           776
assetCategory                         0
assetDescription                      0
                                   ... 
displayedAssets[0].createdBy        773
displayedAssets[0].assetLocation    773
displayedAssets[0].roomNumber       773
displayedAssets[0].createdAt        773
displayedAssets[0]._id              773
Length: 90, dtype: int64
-----------

amoNumber: 776 null values
acquisitionDate: 701 null values
acquisitionPrice: 776 null values
currentMarketValue: 776 null values
images[0].name: 177 null values
images[1].name: 183 null values
images[2].name: 183 null values
images[3].name: 497 null values
images[4].name: 643 null values
images[5].name: 714 null values
images[6].name: 746 null values
images[7].name: 755 null values
images[8].name: 767 null values
images[9].name: 772 null values
images[10].name: 773 null values
images[0].s3Key: 177 null values
images

### Check for duplicates

In [25]:
print("There are", f"\033[1m{df.duplicated().sum()}\033[0m", "duplicates found")

There are [1m0[0m duplicates found


---

### DATA STRUCTURE AND TYPES

In [26]:
rows, columns = df.shape
print(f"Number of rows: {rows:,}")
print(f"Number of columns: {columns:,}")

Number of rows: 776
Number of columns: 90


In [27]:
# Print the data types of the columns
print(df.dtypes)

_id                                  object
accessionId                          object
amoNumber                           float64
assetCategory                        object
assetDescription                     object
                                     ...   
displayedAssets[0].createdBy         object
displayedAssets[0].assetLocation     object
displayedAssets[0].roomNumber        object
displayedAssets[0].createdAt         object
displayedAssets[0]._id               object
Length: 90, dtype: object


In [28]:
print("Count of columns in each type:\n")
df.dtypes.value_counts()

Count of columns in each type:



object     82
float64     6
int64       2
Name: count, dtype: int64

---

### OUTLIERS

In [29]:
def find_extremes(df, column):
    min_value = df[column].min()
    max_value = df[column].max()
    return min_value, max_value

In [30]:
min_val, max_val = find_extremes(df, 'assetQuantity')
print(f"Minimum: {min_val}, Maximum: {max_val}")

Minimum: 1, Maximum: 347


In [31]:
# The 5 smallest values
smallest_values = df['assetQuantity'].nsmallest(5)
print("Smallest values:\n", smallest_values)

# The 5 largest values
largest_values = df['assetQuantity'].nlargest(5)
print("Largest values:\n", largest_values)

Smallest values:
 6     1
7     1
8     1
9     1
10    1
Name: assetQuantity, dtype: int64
Largest values:
 235    347
642     61
5       34
0       23
671     10
Name: assetQuantity, dtype: int64


---

### DATA DISTRIBUTIONS AND SUMMARY STATISTICS

In [32]:
import pandas as pd

def calculate_statistics(df, column):
    """
    Returns the mean, median, and mode for a specified column in the DataFrame.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
        column (str): The column name to calculate statistics for.

    Returns:
        dict: A dictionary containing the mean, median, and mode of the column.
    """
    mean_value = df[column].mean()
    median_value = df[column].median()
    mode_value = df[column].mode().iloc[0] if not df[column].mode().empty else None

    return {
        'mean': mean_value,
        'median': median_value,
        'mode': mode_value
    }

In [33]:
stats = calculate_statistics(df, 'assetQuantity')

# Formatting the output
formatted_stats = (
    f"Statistics for 'assetQuantity' column:\n"
    f"Mean: {stats['mean']}\n"
    f"Median: {stats['median']}\n"
    f"Mode: {stats['mode']}"
)

print(formatted_stats)

Statistics for 'assetQuantity' column:
Mean: 1.7384020618556701
Median: 1.0
Mode: 1


#### In simpler terms:

1. **Average Value**: The average assetQuantity is about 1.7384020618556701
2. **Middle Value**: The middle assetQuantity (when all assetQuantity are sorted) is 1.0
3. **Most Common Value**: The most frequently occurring assetQuantity is 1.

#### Interpretation:
- **Mostly Balanced**: The assetQuantity are mostly balanced around the middle value (1.7384).
- **Consistent Data**: Most assetQuantity are close to each other, with a slight tendency towards lower assetQuantity.