A comprehensive data analysis application built with Python, featuring both command-line and GUI interfaces. This project demonstrates professional-grade data analysis practices with real-world datasets.
This project includes:
- Data Cleaning: Handle missing values, duplicates, and data validation
- Exploratory Data Analysis (EDA): Statistical analysis and correlation studies
- Advanced Visualizations: Multiple chart types using Matplotlib and Seaborn
- Interactive GUI: User-friendly tkinter interface for analysis
- Portfolio Ready: Production-quality code suitable for real-world applications
Data-Analysis-Project/
├── scripts/
│ ├── data_analysis.py # Main analysis script
│ ├── app.py # tkinter GUI application
│ └── create_dataset.py # Dataset generation utility
├── data/ # Data and visualization outputs
│ ├── sample_data.csv # Sample dataset
│ └── *.png # Generated visualizations
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8 or higher
- pip (Python package manager)
-
Clone or navigate to the project directory
cd Data-Analysis-Project -
Create a virtual environment (recommended)
python -m venv venv # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
Run the main analysis script to analyze the built-in Iris dataset:
python scripts/data_analysis.pyThis will:
- Load the Iris dataset
- Perform data cleaning
- Generate comprehensive statistics
- Create 6 different visualization plots
- Display detailed analysis report
Output files created:
data/1_distribution_plots.png- Histogram distributionsdata/2_box_plots.png- Outlier detectiondata/3_correlation_heatmap.png- Variable relationshipsdata/4_scatter_plots.png- Scatter analysisdata/5_categorical_analysis.png- Category distributionsdata/6_pair_plot.png- Complete pair matrix
Launch the tkinter GUI for interactive analysis:
python scripts/app.pyFeatures:
- Data Loading: Load Iris dataset or custom CSV files
- Data Info: View dataset structure and metadata
- Analysis Tools:
- Data Cleaning Report
- Statistical Summary
- Correlation Analysis
- Visualizations:
- Distribution Plots
- Box Plots
- Scatter Plots
- Correlation Heatmap
- Pair Plots
- Export: Save analysis results to CSV and text files
Create a realistic employee dataset:
python scripts/create_dataset.pyThen load data/sample_data.csv in the GUI application.
- Detect and handle missing values
- Remove duplicate records
- Validate data types
- Generate quality assessment reports
- Descriptive statistics (mean, median, std dev, etc.)
- Skewness and kurtosis analysis
- Quartile analysis (Q1, Q3, IQR)
- Correlation coefficient calculation
Distribution Analysis
- Histograms with KDE curves
- Shows data spread and frequency
Outlier Detection
- Box plots for each feature
- Identifies anomalous values
Correlation Heatmap
- Visual correlation matrix
- Color-coded strength indicators
Scatter Plots
- Variable relationships
- Trend identification
Categorical Analysis
- Bar charts for categorical data
- Frequency distributions
Pair Plot Matrix
- Complete relationship overview
- All variable combinations
Load Data → Clean Data → Exploratory Analysis → Visualize → Infer Results
- Built-in Iris dataset or custom CSV
- Check for missing values
- Remove duplicates
- Validate data types
- Calculate descriptive statistics
- Analyze correlations
- Identify distributions
- Create comprehensive charts
- Identify patterns
- Highlight relationships
- Interpret statistical results
- Identify key insights
- Generate recommendations
The Iris dataset contains 150 samples with 4 features:
- Sepal Length (cm)
- Sepal Width (cm)
- Petal Length (cm)
- Petal Width (cm)
- Species (Setosa, Versicolor, Virginica)
- Strong Correlations: Petal measurements show high correlation (r > 0.96)
- Species Differentiation: Clear clustering by species in visualization
- Feature Importance: Petal length/width are better discriminators
- Distribution: Most features follow approximately normal distributions
See requirements.txt:
pandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0
pip install -r requirements.txtThis project demonstrates:
✅ Real-World Data Analysis: Complete pipeline from raw data to insights ✅ Professional Coding: Clean, documented, modular Python code ✅ Multiple Interfaces: Both CLI and GUI implementations ✅ Data Visualization: Advanced plotting with Matplotlib and Seaborn ✅ Statistical Analysis: Comprehensive statistical methods ✅ Best Practices: Error handling, documentation, code organization ✅ Pandas/NumPy Expertise: Advanced data manipulation ✅ User Interface Design: Professional tkinter GUI with organized layout
The GUI app supports loading any CSV file:
- Launch the app:
python scripts/app.py - Click "Load CSV File"
- Select your dataset
- Run analysis tools as needed
Edit scripts/data_analysis.py to:
- Add custom statistical tests
- Create domain-specific visualizations
- Implement specialized cleaning for your data
- Add predictive models
analyzer = DataAnalyzer(data_source='path/to/data.csv')
analyzer.data_cleaning()
analyzer.exploratory_data_analysis()
analyzer.create_visualizations()
analyzer.generate_report()- Multi-tab interface (Console, Visualization, Data Preview)
- Real-time output logging
- Interactive chart generation
- Export functionality
- Error handling with user feedback
Issue: GUI doesn't launch
- Solution: Ensure tkinter is installed (usually comes with Python)
Issue: Missing library error
- Solution: Run
pip install -r requirements.txt
Issue: CSV file won't load
- Solution: Ensure CSV file is properly formatted and accessible
Issue: Plots not displaying
- Solution: Check that data contains numeric columns for visualization
- Start with Iris Dataset: Understand the analysis flow first
- Explore GUI Features: Try all visualization options
- Custom Data: Test with your own datasets
- Modify Code: Adapt analysis for your specific needs
- Save Results: Export analysis for presentations
After using this project, you'll understand:
- Data cleaning and preprocessing techniques
- Statistical analysis and correlation studies
- Data visualization best practices
- GUI development with tkinter
- Professional Python project structure
- Real-world data analysis workflows
Feel free to:
- Add new visualization types
- Implement additional statistical tests
- Optimize performance
- Improve user interface
- Add new analysis features
This project is provided as-is for educational and portfolio purposes.
For questions or improvements, feel free to extend the project with your own enhancements.
Happy Analyzing! 📊
Start your data science journey with this professional-grade analysis tool.