# Data Overview: Summarize and Preview Your Dataset Like a Pro

   When working with data, the first step isn’t analysis or modeling—it’s understanding.Taking time to perform a comprehensive data overview and summary not only highlights potential challenges but also sets the stage for impactful insights. A quick preview can reveal patterns, inconsistencies, and opportunities, helping you approach your dataset with clarity. This post covers the essentials of data overviews, summaries, and previews, providing actionable steps and Python tools to make the process seamless.

This Python script automates Git operations using subprocess to ensure a clean synchronization between a local repository and a remote GitHub repository.

- subprocess: Allows the script to run shell commands as subprocesses.
- os: Used here to check the current working directory.


## Why Data Overview Matters

Starting data analysis without an overview is like setting sail without checking the weather forecast—you might make progress, but you're unprepared for the challenges ahead. By summarising and previewing your dataset, you can:

- Understand its structure and size.
- Identify missing, duplicate, or problematic data.
- Gain insights into variable types and distributions.

A proper data overview ensures you’re working with clean, well-organized data, saving time and avoiding pitfalls later in the process.



## Automating Overviews with Python
Manually summarizing large datasets can be tedious, but Python makes it easy. In this post, we’ll explore how to streamline the process using custom functions.

In [31]:
import numpy as np
import pandas as pd


In [32]:
train_data=pd.read_csv("rightmove_prop.csv")


In [33]:
train_data

Unnamed: 0,Property Address,Agent Address,Agent Name,Available Date,Property Type,Bedrooms,Bathrooms,Post Date,Price,Latitude,Longitude,URL
0,"Kampus, Apt 1209 South, 59 Chorlton Street, Ma...","Aytoun Street, Manchester, M1 3GL","Native Communities, Manchester",Let available date: Now,,1.0,1.0,Added on 13/02/2025,"£1,525 pcm\n£352 pw",53.476880,-2.234550,https://www.rightmove.co.uk/properties/1581954...
1,"High Street, Manchester, M4","20 Wenlock Road, London, N1 7GU","OpenRent, London",Let available date: Now,Flat,1.0,1.0,Reduced today,£975 pcm\n£225 pw,53.485134,-2.236448,https://www.rightmove.co.uk/properties/1580019...
2,"Watson Street, Manchester, M3","20 Wenlock Road, London, N1 7GU","OpenRent, London",Let available date: 22/02/2025,Flat,1.0,1.0,Reduced today,"£1,100 pcm\n£254 pw",53.477320,-2.248315,https://www.rightmove.co.uk/properties/1573810...
3,"Ancoats Gardens, 14 Rochdale Rd, M4",One St Peter's Square Manchester M2 3DE,"Vesper Homes, Manchester",Let available date: Now,Apartment,2.0,1.0,Reduced today,"£1,300 pcm\n£300 pw",53.487509,-2.233231,https://www.rightmove.co.uk/properties/1548482...
4,"Chevington Drive, Heaton Mersey, Stockport, SK4","14 Moorside Road, Heaton Moor, Stockport, SK4 4DT","Julian Wadden, Heaton Moor",Let available date: 17/03/2025,Semi-Detached,3.0,1.0,Reduced today,"£1,450 pcm\n£335 pw",53.416725,-2.209836,https://www.rightmove.co.uk/properties/1574637...
...,...,...,...,...,...,...,...,...,...,...,...,...
774,"Wilmott Street, Manchester, Greater Manchester...","50 Bridge Street, Manchester, M3 3BW","Savills Lettings, Manchester",Let available date: Now,Apartment,,1.0,Added on 12/02/2025,"£1,350 pcm\n£312 pw",53.471124,-2.245620,https://www.rightmove.co.uk/properties/8707828...
775,"Armitage Street, Manchester","Sentinel House, Albert Street, Eccles, Manches...","Hills, Eccles",Let available date: Now,Terraced,2.0,1.0,Added on 12/02/2025,"£1,050 pcm\n£242 pw",53.480202,-2.355132,https://www.rightmove.co.uk/properties/1566649...
776,"Granby House, Granby Row, Manchester, M1","289 - 291 Deansgate, Manchester, M3 4EW","Leaders Lettings, Manchester",Let available date: 14/04/2025,Apartment,1.0,1.0,Added on 12/02/2025,£995 pcm\n£230 pw,53.474990,-2.235900,https://www.rightmove.co.uk/properties/1581476...
777,"Parrs Wood Road, Fallowfield, Manchester, M20 4RQ","Townhouse, 117 Ducie House, Ducie Street, Manc...","Townhouse, Manchester",Let available date: 01/07/2025,Semi-Detached,6.0,2.0,Reduced on 12/02/2025,"£4,030 pcm\n£930 pw",53.428707,-2.216506,https://www.rightmove.co.uk/properties/1541239...


### Save these functions into a python file called functions

In [34]:
def data_overview(data, title):
    overview_analysis = {f'{title}':[data.shape[1], data.shape[0], 
                                     data.isnull().any(axis=1).sum(), 
                                     data.isnull().any(axis=1).sum()/len(data)*100,
                                     data.duplicated().sum(),
                                    data.duplicated().sum()/len(data)*100, 
                                     sum((data.dtypes == 'object') & (data.nunique() > 2)),
                                     sum((data.dtypes == 'object') & (data.nunique() < 3)),
                                     data.select_dtypes(include=['int64', 'float64']).shape[1]
                                    ]}
    overview_analysis=pd.DataFrame(overview_analysis, index=['Columns','Rows','Missing_Values','Missing_Values %',
                                                             'Duplicates', 'Duplicates %','Categorical_variables','Boolean_variables','Numerical_variables']).round(2)
    return overview_analysis




def variables_overview(data):
    variable_details = {
        'unique': data.nunique(),
        'dtype': data.dtypes,
        'null': data.isna().sum(),
        'null %': data.isna().sum() / len(data) * 100,
    }
    variable_details = pd.DataFrame(variable_details)


    # Add a new column 'has_non_alphanumeric' to indicate if there are non-alphanumeric characters
    
    variable_details['has_non_alphanumeric'] = data.apply(lambda col: any(col.apply(lambda x: not str(x).replace(" ", "").isalnum())))
    #variable_details['has_non_alphanumeric'] = data.apply(lambda col: any(col.apply(lambda x: not str(x).isalnum())))

    # Add a new column 'has_digits' to indicate if there are rows containing digits in each column
    variable_details['has_digits'] = data.apply(lambda col: any(col.apply(has_digits)))

    
    return variable_details



In [35]:
from functions import data_overview, variables_overview

In [36]:
data_overview(train_data, "Data_Overview")   

Unnamed: 0,Data_Overview
Columns,12.0
Rows,779.0
Missing_Values,80.0
Missing_Values %,10.27
Duplicates,19.0
Duplicates %,2.44
Categorical_variables,8.0
Boolean_variables,0.0
Numerical_variables,4.0


In [37]:
variables_overview(train_data)

Unnamed: 0,unique,dtype,null,null %,has_non_alphanumeric,has_digits
Property Address,590,object,3,0.385109,True,True
Agent Address,193,object,3,0.385109,True,True
Agent Name,192,object,0,0.0,True,True
Available Date,100,object,0,0.0,True,True
Property Type,17,object,1,0.12837,True,True
Bedrooms,10,float64,33,4.2362,True,True
Bathrooms,7,float64,46,5.905006,True,True
Post Date,16,object,0,0.0,True,True
Price,247,object,0,0.0,True,True
Latitude,571,float64,0,0.0,True,True


In [None]:


# Define the notebook filename and repository
notebook_filename = "Data_Overview.ipynb"

# Git commands to push the Jupyter Notebook
commands = [
    "git init",
    f"git add {notebook_filename}",
    'git commit -m "Added Jupyter Notebook"',
    f"git remote add origin {repo_url}",
    "git branch -M main",
    "git push -u origin main"
]

for command in commands:
    run_command(command)

print("✅ Jupyter Notebook successfully pushed to GitHub!")
