The aim of this project was to delve deep into the realm of cinema and derive meaningful insights through the lens of data journalism. My mission was to analyze a dataset consisting of films, critics, and directors, originally sourced from a BBC article. The goal was to identify trends and narratives in film critique and production spanning the last 15 years, offering a unique perspective on the global film industry.
The project commenced with the complex task of data collection. Leveraging Python libraries such as BeautifulSoup and requests, I meticulously scraped a comprehensive list of movies, critics, and their affiliations from the BBC's webpage. This phase required acute attention to detail, especially due to the minimalistic layout of the source page, which presented its own set of challenges in data isolation.
Following data acquisition, the next phase involved structuring the raw HTML data into a format amenable for analysis. I transformed the data into Python lists and dictionaries and subsequently into a structured pandas DataFrame. This process was critical in preparing the data for the subsequent analysis phase.
I extensively utilized regular expressions to parse and extract essential information from the dataset, including critic names, affiliations, countries, movie titles, directors, and release years.
Parallel to the initial data scraping, I expanded the data collection to Wikipedia. This involved scraping director biographical details, thereby enriching the dataset with contextual information about their birthplaces and dates.
In an effort to bring the data to life visually, I integrated the Mapbox API for advanced geospatial data representations, allowing me to create interactive and engaging maps that vividly illustrate the global distribution and impact of film directors and critics.
- 🇺🇸 The US, as expected, had the most number of critically acclaimed directors as well as the highest number of votes from critics
- 🇨🇳 Movies directed by Chinese directors had the second-highest number of votes - 98 in total
- 🇮🇷 A surprising finding was that Iranian movies were very popular with critics. Six directors were critically acclaimed, with Asghar Farhadi leading the pack with 31 votes
- Data Scraping and Cleaning: This project significantly enhanced my skills in web scraping and the intricacies of data cleaning and preparation.
- Regular Expressions Mastery: Through extensive use, I developed a proficiency in using regex for effective data extraction.
- Pandas Proficiency: The process of data transformation using pandas was invaluable, improving my ability to manipulate and analyze large datasets.
- Creative Problem-Solving: Each challenge encountered necessitated unique and creative solutions, underscoring the importance of adaptability and innovation in data journalism.
In retrospect, several areas could be refined in future projects:
- Automation and Efficiency: Automating certain aspects of the scraping process would enhance efficiency.
- Advanced Data Analysis: Implementing more sophisticated statistical methods could yield deeper insights.
- Enhanced Visual Storytelling: Incorporating advanced data visualization techniques would further augment the storytelling aspect of the data.
This project represents a comprehensive endeavor in data journalism, blending technical data processing with narrative storytelling. It stands as a testament to the power of data in revealing hidden stories within the film industry.