Skip to content
Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time

201819A COM5507 Social Media Data Acquisition and Processing

  • This repository was created in 2018 Fall. It stores the course documents of a postgraduate-level course, COM5507 Social Media Data Acquisition and Processing, for the Master of Arts in Communication and New Media program (MACNM) @ City University of Hong Kong (CityU), guest lectured by Dr. Xinzhi Zhang from Hong Kong Baptist University.

  • #Data_science_101 #Python #automated #web_data_collection #opendata #web_scraping #API #pandas #numpy #tm #sna #dataviz #macnm #cityucom_10thanniversary

Course Instructor (Guest)


This course aims to introduce the fundamental knowledge and hands-on skills of big data analytics in the field of media and communication. Special focus will be placed on techniques for searching, collecting, analyzing, interpreting, and visualizing data. Technical details include, but not limited to, web crawling, data storage, data analysis, text mining, social network analysis, and data visualization, based on open source software packages. Through a variety of teaching learning activities, such as class demonstrations, individual exercises, quizzes, collaborative projects, and guest lectures, by the end of the semester, students are expected to become capable to collect big data from different data sources, i.e., social media harvesting, web scraping, online archiving or indexing data retrieving, with open source software packages. Students are also expected to produce socially, culturally, or commercially meaningful data-driven narrative outputs, such as data-driven journalistic report, data visualization, data-driven business analysis, and computational social science research reports. Meanwhile, critical reflection on the overuse and abuse of big data and relevant ethical and legal controversies will be discussed throughout the semester as well.

Course Structure

This course contains a total of 13 classes (weeks). Each class lasts for 3 hours. There are 11 lectures (including in-class assignments and tutorials), 1 project consultation week, and 1 presentation week.

The lectures are divided into four units, and several additional workshops, plus a presentation week:

  • Unit 1: Data science fundamentals and basic Python programming (week 1 – 4)
  • Unit 2: Automated web data collection (week 5 – 8)
  • Unit 3: Data processing and data management (week 9 – 11)
  • Unit 4: Data exploration (week 11 - 12)
  • Project implementation & presentation

Course Syllabus (weekly teaching plan)

Week Content Tools, packages, & techs Documents
Week 1 Introduction: Media and communication in the digital age Tools installation (Python; Anaconda, Jupyter Notebook; Git and GitHub; Markdown language)
  • Slides 1. Intro
  • Slides 2. Data science in Action
  • Slides 3. Tools installation
  • Week 2 Python in action: A command-liner's perspective Python (program execution, variables, expressions, data structure, function); command line interface
  • Slides
  • Code examples
  • Week 3 Python in action: in an interactive notebook Python (control flow statements, errors and debugging); Jupyter Notebook, Numpy, Pandas
  • Slides
  • Code examples
  • Week 4 Data science pipeline & project implementation (1) Data scientists' workflow, data-driven investigation
  • Slides
  • Week 5 Web scraping ep. 1 Web technologies (HTTP, HTML, CSS), Requests, BeautifulSoup
  • Slides
  • Once upon a time at CityU
  • Once upon a decade at CityU
  • Week 6 Web scraping ep. 2 Data sources, web scraping pipelines, Requests, BeautifulSoup, Pandas
  • Slides
  • Week 7 Mining the social web Web data formats (JSON, XML), API, Cloud computing
  • Slides
  • Codes for XML
  • Codes for JSON
  • Week 8 More topics in web data collection: structures and automation Exploring Selenium
  • Slides
  • "Bags are cure-all"
  • Week 9 Data processing Pandas
  • Slides
  • Week 10 Data exploration ep. 1: Numerical data processing Pandas
  • Slides
  • Week 11 Data exploration ep. 2: Text data processing Regex, Pandas, Matplotlib Slides
    Week 12 Data exploration ep. 3: Networks, maps, and project finalizing Pandas, Matplotlib Slides
    Week 13 Group project presentation Integrated data-driven storytelling Guidelines for the final project

    Notes: the code examples are for educational purposes and in-class demonstrations only. Since the webpages for harvesting are subject to change, the codes presented in this course may not work as always.

    Representative Students' Works

    Type The task Documentation
    Individual assignment 01 Screen scraping: scraping information from a single webpage or multiple webpages, and storing the information into a machine-readable "spreadsheet" format Link
    Individual assignment 02 Data processing and data exploration for text and numerical data Link
    Group exercise 01 "A thought experiment": converting a text-based story into a data-driven news reports ("datafication") Link
    Group exercise 02 Another "thought experiment": converting a data-driven news report into a text one ("de-datafication") Link
    Final project Integrated data-driven storytelling projects Link

    About the Instructor

    • Xinzhi Zhang (MA. & Ph.D., City University of Hong Kong, 2013) is a Research Assistant Professor at the Department of Journalism at the Hong Kong Baptist University (the official page). His research interests include digital media and social change, comparative political communication, digital media and public health, and the social implications of big data technologies and AI algorithms. He is also an observer of computational social science and digital humanities. His research work has appeared in peer-reviewed journals such as International Political Science Review, Computers in Human Behavior, International Journal of Communication, and Digital Journalism. He currently serves as the Programme Director of Data and Media Communication concentration, an interdisciplinary concentration on data science and data-driven investigative reporting and storytelling, jointly offered by the Department of Computer Science and the Department of Journalism at HKBU.


    This repository documents the course materials of my course COM5507 @ CityU in 2018 Fall.




    No releases published


    No packages published