A high-performance, asynchronous web scraper for extracting real estate property data from mulk.az for analytics and data analysis purposes.
- Asynchronous scraping using
aiohttp
andasyncio
for maximum performance - Comprehensive data extraction including property details, prices, locations, and contact information
- Multiple output formats (CSV, JSON) for data analysis
- Built-in analytics with price statistics and market insights
- Rate limiting and retry logic to respect the website
- Configurable concurrency to balance speed and server load
- Pagination handling to scrape multiple pages automatically
- Data validation and cleaning for analytics-ready output
- Clone or download the scraper files
- Install dependencies:
pip install -r requirements.txt
import asyncio
from async_mulk_scraper import AsyncMulkAzScraper
async def main():
scraper = AsyncMulkAzScraper(max_concurrent=10, delay=0.5)
# Scrape properties
search_url = "https://www.mulk.az/search.php?category=&lease=false&bolge_id=1"
properties = await scraper.scrape_all_properties(search_url, max_pages=3)
# Save data
await scraper.save_to_csv_async("properties.csv")
await scraper.save_to_json_async("properties.json")
# Get analytics
analytics = scraper.get_analytics_summary()
print(f"Average price: {analytics['price_stats']['avg']:,} AZN")
if __name__ == "__main__":
asyncio.run(main())
# Run interactive examples
python example_usage.py
# Run the main async scraper
python async_mulk_scraper.py
# Run analytics version (slower but more features)
python analytics_scraper.py
File | Description | Best For |
---|---|---|
async_mulk_scraper.py |
High-performance async scraper | Large-scale data collection |
mulk_scraper.py |
Traditional sync scraper | Small-scale scraping, learning |
analytics_scraper.py |
Enhanced version with visualizations | Data analysis and reporting |
example_usage.py |
Usage examples and tests | Learning and testing |
Each property record includes:
listing_id
- Unique property IDtitle
- Property titleurl
- Detail page URLlisting_date
- When property was listedscraped_at
- When data was scraped
price
/price_numeric
- Property price in AZNcategory
- Property type (apartment, house, etc.)rooms
/rooms_numeric
- Number of roomsarea
/area_numeric
- Property area in m²floor
/current_floor
/total_floors
- Floor informationdeed_available
- Legal documentation status
location_district
- District/regionlocation_neighborhood
- Neighborhoodlocation_metro
- Nearest metro stationfull_address
- Complete address
contact_person
- Contact person namecontact_type
- Type of contact (agent, owner, etc.)contact_phone
- Phone number
description
- Property descriptionimages
- Array of image URLsimage_count
- Number of images
The scraper works with any mulk.az search URL. Here are some examples:
# All properties for sale
"https://www.mulk.az/search.php?category=&lease=false&bolge_id=1"
# Apartments only
"https://www.mulk.az/search.php?category=apartment&lease=false&bolge_id=1"
# Price range 100k-300k AZN
"https://www.mulk.az/search.php?pricemin=100000&pricemax=300000&lease=false&bolge_id=1"
# 3-room properties
"https://www.mulk.az/search.php?rooms=3&lease=false&bolge_id=1"
# Specific district (Nasimi)
"https://www.mulk.az/search.php?rayon_id=2&lease=false&bolge_id=1"
scraper = AsyncMulkAzScraper(
max_concurrent=10, # Number of concurrent requests
delay=0.5 # Delay between requests (seconds)
)
Setting | Conservative | Balanced | Aggressive |
---|---|---|---|
max_concurrent | 5 | 10 | 15-20 |
delay | 1.0s | 0.5s | 0.2-0.3s |
Best for | Slow/unstable connection | General use | Fast connection, bulk scraping |
Perfect for Excel, data analysis tools:
listing_id,title,price_numeric,location_district,contact_phone,...
348729,Satış » Köhnə tikili,205000,Sabunçu,(070) 845-73-70,...
Structured data for programming:
[
{
"listing_id": "348729",
"title": "Satış » Köhnə tikili",
"price_numeric": 205000,
"location_district": "Sabunçu",
"contact_phone": "(070) 845-73-70",
"images": ["https://mulk.az/images/555231.jpg", ...]
}
]
analytics = scraper.get_analytics_summary()
print(analytics)
Output:
{
"total_properties": 150,
"valid_properties": 142,
"price_stats": {
"min": 23000,
"max": 550000,
"avg": 185000,
"median": 165000
},
"top_districts": {
"Sabunçu": 45,
"Abşeron": 32,
"Nəsimi": 28
}
}
- Price distribution charts
- Properties by district visualization
- Price per square meter analysis
- Market trend analysis
- Comprehensive reporting
Typical performance on a modern machine:
Scenario | Properties | Time | Rate |
---|---|---|---|
Small test (1 page) | ~25 props | 15s | 1.7/sec |
Medium scrape (3 pages) | ~75 props | 45s | 1.7/sec |
Large scrape (10 pages) | ~250 props | 150s | 1.7/sec |
Performance depends on network speed, server response time, and concurrency settings
- Use reasonable delays (0.5s minimum)
- Limit concurrent requests (10 max recommended)
- Don't scrape during peak hours
- Cache results to avoid repeated requests
- Always validate extracted data
- Handle missing fields gracefully
- Clean and normalize data for analysis
- Remove duplicates
- The scraper includes built-in retry logic
- Check logs for failed requests
- Monitor success rates
- Implement fallback strategies
# Compare prices across districts
properties = await scraper.scrape_all_properties(search_url, max_pages=10)
district_prices = {}
for prop in properties:
if prop.location_district and prop.price_numeric:
district_prices[prop.location_district] = district_prices.get(prop.location_district, [])
district_prices[prop.location_district].append(prop.price_numeric)
for district, prices in district_prices.items():
avg_price = sum(prices) / len(prices)
print(f"{district}: {avg_price:,.0f} AZN average")
# Find undervalued properties (price per sqm)
undervalued = []
for prop in properties:
if prop.area_numeric and prop.price_numeric:
price_per_sqm = prop.price_numeric / prop.area_numeric
if price_per_sqm < 2000: # Below 2000 AZN/sqm
undervalued.append(prop)
print(f"Found {len(undervalued)} potentially undervalued properties")
# Analyze agent vs owner listings
agent_count = len([p for p in properties if 'agent' in p.contact_type.lower()])
owner_count = len([p for p in properties if 'owner' in p.contact_type.lower()])
print(f"Agents: {agent_count}, Owners: {owner_count}")
"No properties found"
- Check if the search URL is valid
- Try reducing max_pages to test
- Verify internet connection
"Too many request errors"
- Increase delay between requests
- Reduce max_concurrent setting
- Check if IP is being rate limited
"Invalid data in output"
- Some properties may have incomplete data
- Filter out invalid entries before analysis
- Check the website structure hasn't changed
"Slow performance"
- Increase max_concurrent (up to 15)
- Reduce delay (down to 0.3s)
- Use async version instead of sync
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
- This scraper is for educational and research purposes
- Respect the website's terms of service
- Don't overload the server with excessive requests
- Use scraped data responsibly and ethically
- Consider contacting the website for official API access
- Respect robots.txt guidelines
Feel free to improve the scraper:
- Add new data fields
- Improve error handling
- Optimize performance
- Add new analytics features
- Fix bugs and edge cases
This project is provided as-is for educational purposes. Use responsibly and in accordance with applicable laws and terms of service.