#### 1. Creating Reading & Writing to a File

##### Importing Libraries

In [2]:
import pandas as pd
import numpy as np

##### Data Loading

In [3]:
df = pd.read_csv('../data/NYC_Rolling_Sales_Dataset/nyc-rolling-sales.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


##### Total No. of Rows & Columns

In [4]:
df.shape 

(84548, 22)

##### Dataset Features

In [5]:
df.columns 

Index(['Unnamed: 0', 'BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')

##### Information about Dataset

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84548 entries, 0 to 84547
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Unnamed: 0                      84548 non-null  int64 
 1   BOROUGH                         84548 non-null  int64 
 2   NEIGHBORHOOD                    84548 non-null  object
 3   BUILDING CLASS CATEGORY         84548 non-null  object
 4   TAX CLASS AT PRESENT            84548 non-null  object
 5   BLOCK                           84548 non-null  int64 
 6   LOT                             84548 non-null  int64 
 7   EASE-MENT                       84548 non-null  object
 8   BUILDING CLASS AT PRESENT       84548 non-null  object
 9   ADDRESS                         84548 non-null  object
 10  APARTMENT NUMBER                84548 non-null  object
 11  ZIP CODE                        84548 non-null  int64 
 12  RESIDENTIAL UNITS               84548 non-null

##### Short Summary 

In [7]:
summary = pd.DataFrame({
    'Total Rows': [df.shape[0]],
    'Total Columns': [df.shape[1]],
    'Missing Values': [df.isnull().sum().sum()]
})

summary

Unnamed: 0,Total Rows,Total Columns,Missing Values
0,84548,22,0


##### Column-level summary

In [8]:
column_summary = pd.DataFrame({
    'Column Name': df.columns,
    'Data Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Null Count': df.isnull().sum().values,
    'Unique Values': [df[col].nunique() for col in df.columns]
})

print("Detailed Column Summary:")
print(column_summary.to_string(index=False))
print("\n" + "="*50 + "\n")

Detailed Column Summary:
                   Column Name Data Type  Non-Null Count  Null Count  Unique Values
                    Unnamed: 0     int64           84548           0          26736
                       BOROUGH     int64           84548           0              5
                  NEIGHBORHOOD    object           84548           0            254
       BUILDING CLASS CATEGORY    object           84548           0             47
          TAX CLASS AT PRESENT    object           84548           0             11
                         BLOCK     int64           84548           0          11566
                           LOT     int64           84548           0           2627
                     EASE-MENT    object           84548           0              1
     BUILDING CLASS AT PRESENT    object           84548           0            167
                       ADDRESS    object           84548           0          67563
              APARTMENT NUMBER    object           

##### Select Specific Column

In [9]:
df['SALE PRICE'].head(10)

0     6625000
1         -  
2         -  
3     3936272
4     8000000
5         -  
6     3192840
7         -  
8         -  
9    16232000
Name: SALE PRICE, dtype: object

##### Summary

In [10]:
summary.to_csv('../outputs/dataset_summary.csv', index=False)

### üêº Module 1: Creating, Reading & Writing with Pandas

#### üìå Objective
Learn how to load, explore, summarize, and export data using **Pandas** through hands-on practice with a real-world dataset.

---

#### üìÇ Dataset
**NYC Rolling Sales Dataset** This dataset contains property sale transactions across New York City. Key features include:
* **Borough & Neighborhood**
* **Building Class**
* **Sale Price**
* **Sale Date**

---

#### üîß Work Done
In this module, I performed the following data operations:

1.  **Environment Setup**: Imported essential libraries (`Pandas`, `NumPy`).
2.  **Data Ingestion**: Loaded the CSV dataset using `read_csv()`.
3.  **Initial Exploration**: 
    * Previewed data using `head()`.
    * Checked dataset dimensions with `shape`.
    * Identified all features using `columns`.
4.  **Deep Inspection**: Analyzed data types and missing values using `info()`.
5.  **Summary Generation**:
    * **Dataset-level**: Captured total rows, columns, and global missing value counts.
7.  **Data Export**: Saved the generated summary report to a new CSV file using `to_csv()`.

---

#### üìä Key Outcomes
* **Structural Insight**: Gained a comprehensive understanding of the NYC sales data quality and architecture.
* **Audit Readiness**: Identified missing data patterns and mismatched data types across columns.
* **Reproducibility**: Created reusable summary tables that can be applied to future datasets for rapid reporting.

---

#### üõ†Ô∏è Tools Used
* **Language**: Python
* **Libraries**: Pandas, NumPy
* **Environment**: Jupyter Notebook (VS Code)

---