<a href="https://colab.research.google.com/github/Alan-Turing18this-Year/Automated-EDA-Narrator-Data-Quality-Scoring-Tool/blob/main/Output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%cd /content
!rm -rf Automated-EDA-Narrator-Data-Quality-Scoring-Tool


/content


In [2]:
!git clone https://github.com/Alan-Turing18this-Year/Automated-EDA-Narrator-Data-Quality-Scoring-Tool.git
%cd Automated-EDA-Narrator-Data-Quality-Scoring-Tool
!ls


Cloning into 'Automated-EDA-Narrator-Data-Quality-Scoring-Tool'...
remote: Enumerating objects: 199, done.[K
remote: Counting objects:   3% (1/29)[Kremote: Counting objects:   6% (2/29)[Kremote: Counting objects:  10% (3/29)[Kremote: Counting objects:  13% (4/29)[Kremote: Counting objects:  17% (5/29)[Kremote: Counting objects:  20% (6/29)[Kremote: Counting objects:  24% (7/29)[Kremote: Counting objects:  27% (8/29)[Kremote: Counting objects:  31% (9/29)[Kremote: Counting objects:  34% (10/29)[Kremote: Counting objects:  37% (11/29)[Kremote: Counting objects:  41% (12/29)[Kremote: Counting objects:  44% (13/29)[Kremote: Counting objects:  48% (14/29)[Kremote: Counting objects:  51% (15/29)[Kremote: Counting objects:  55% (16/29)[Kremote: Counting objects:  58% (17/29)[Kremote: Counting objects:  62% (18/29)[Kremote: Counting objects:  65% (19/29)[Kremote: Counting objects:  68% (20/29)[Kremote: Counting objects:  72% (21/29)[Kremote: Counting 

In [3]:
!pip install -r requirements.txt




In [4]:
import sys
sys.path.append('src')


In [5]:
# Create sample CSV
import pandas as pd

data = {
    "id": [1,2,3,4,5,5,6],
    "name": ["Alice","Bob","Charlie","Denise","Ed","Ed","Frank"],
    "age": [34,29,45,None,23,23,200],
    "gender": ["Female","Male","Male","Female","Male","Male","Male"],
    "salary": [54000,48000,120000,60000,40000,40000,999999],
    "signup_date": ["2020-01-15","2019-11-02","2018-06-21","2021-03-11",
                    "2019-12-31","2019-12-31","2020-02-02"]
}
df = pd.DataFrame(data)
df.to_csv("data/sample.csv", index=False)


In [6]:
%%writefile src/orchestrator.py
from loader import DataLoader
from preprocessor import Preprocessor
from eda_analyzer import EDAAnalyzer
from quality_scorer import QualityScorer
from narrator import Narrator
from report_builder import ReportBuilder

class DatasetPipeline:
    def __init__(self, path):
        self.path = path        # CSV file path
        self.df = None          # DataFrame
        self.pre = None
        self.eda = None
        self.scores = None
        self.narrative = None
        self.report = None

    def run(self):
        # Load data
        self.df = DataLoader(self.path).load()

        # Preprocessing
        self.pre = Preprocessor(self.df).trim_strings(self.df.select_dtypes(include=['object']).columns.tolist())
        clean_df = self.pre.get()

        # EDA
        analyzer = EDAAnalyzer(clean_df)
        eda_results = analyzer.run_all()

        # Scoring
        scorer = QualityScorer(eda_results, df_len=len(clean_df))
        scorer.overall_score()
        scores = scorer.scores

        # Narration
        narrator = Narrator(eda_results, scores)
        narrative = narrator.generate()

        # Report
        builder = ReportBuilder(narrative, eda_results, scores)
        md = builder.to_markdown()

        # Save internal state
        self.eda = eda_results
        self.scores = scores
        self.narrative = narrative
        self.report = md

        return md


Overwriting src/orchestrator.py


In [7]:
import sys
sys.path.append('src')


In [8]:
from orchestrator import DatasetPipeline


In [9]:
pipeline = DatasetPipeline("data/sample.csv")
md = pipeline.run()
print(md)


# Automated EDA Report


## Narrative Insights


- Column 'id' has mean 3.71 and std 1.80.

- Column 'age' has mean 59.00 and std 69.56.

- Column 'salary' has mean 194571.29 and std 356233.09.

- Column 'age' has 1 missing values (14.29%).

- Column 'age' has 1 detected outliers.

- Column 'salary' has 1 detected outliers.

- Overall data quality: 81.67/100 â€” Good.


## Quality Scores


| Metric     |   Score |
|------------|---------|
| missing    |   97.62 |
| duplicates |   71.43 |
| outliers   |   57.14 |
| balance    |   90    |
| overall    |   81.67 |
