201819A COM5507 Social Media Data Acquisition and Processing
This repository was created in 2018 Fall. It stores the course documents of a postgraduate-level course, COM5507 Social Media Data Acquisition and Processing, for the Master of Arts in Communication and New Media program (MACNM) @ City University of Hong Kong (CityU), guest lectured by Dr. Xinzhi Zhang from Hong Kong Baptist University.
#Data_science_101 #Python #automated #web_data_collection #opendata #web_scraping #API #pandas #numpy #tm #sna #dataviz #macnm #cityucom_10thanniversary
Course Instructor (Guest)
- Dr. Xinzhi Zhang, Research Assistant Professor, the Department of Journalism, Hong Kong Baptist University.
- Office hour: by appointment
This course aims to introduce the fundamental knowledge and hands-on skills of big data analytics in the field of media and communication. Special focus will be placed on techniques for searching, collecting, analyzing, interpreting, and visualizing data. Technical details include, but not limited to, web crawling, data storage, data analysis, text mining, social network analysis, and data visualization, based on open source software packages. Through a variety of teaching learning activities, such as class demonstrations, individual exercises, quizzes, collaborative projects, and guest lectures, by the end of the semester, students are expected to become capable to collect big data from different data sources, i.e., social media harvesting, web scraping, online archiving or indexing data retrieving, with open source software packages. Students are also expected to produce socially, culturally, or commercially meaningful data-driven narrative outputs, such as data-driven journalistic report, data visualization, data-driven business analysis, and computational social science research reports. Meanwhile, critical reflection on the overuse and abuse of big data and relevant ethical and legal controversies will be discussed throughout the semester as well.
This course contains a total of 13 classes (weeks). Each class lasts for 3 hours. There are 11 lectures (including in-class assignments and tutorials), 1 project consultation week, and 1 presentation week.
The lectures are divided into four units, and several additional workshops, plus a presentation week:
- Unit 1: Data science fundamentals and basic Python programming (week 1 – 4)
- Unit 2: Automated web data collection (week 5 – 8)
- Unit 3: Data processing and data management (week 9 – 11)
- Unit 4: Data exploration (week 11 - 12)
- Project implementation & presentation
Course Syllabus (weekly teaching plan)
|Week||Content||Tools, packages, & techs||Documents|
|Week 1||Introduction: Media and communication in the digital age||Tools installation (Python; Anaconda, Jupyter Notebook; Git and GitHub; Markdown language)|
|Week 2||Python in action: A command-liner's perspective||Python (program execution, variables, expressions, data structure, function); command line interface|
|Week 3||Python in action: in an interactive notebook||Python (control flow statements, errors and debugging); Jupyter Notebook, Numpy, Pandas|
|Week 4||Data science pipeline & project implementation (1)||Data scientists' workflow, data-driven investigation|
|Week 5||Web scraping ep. 1||Web technologies (HTTP, HTML, CSS), Requests, BeautifulSoup|
|Week 6||Web scraping ep. 2||Data sources, web scraping pipelines, Requests, BeautifulSoup, Pandas|
|Week 7||Mining the social web||Web data formats (JSON, XML), API, Cloud computing|
|Week 8||More topics in web data collection: structures and automation||Exploring Selenium|
|Week 9||Data processing||Pandas|
|Week 10||Data exploration ep. 1: Numerical data processing||Pandas|
|Week 11||Data exploration ep. 2: Text data processing||Regex, Pandas, Matplotlib||Slides|
|Week 12||Data exploration ep. 3: Networks, maps, and project finalizing||Pandas, Matplotlib||Slides|
|Week 13||Group project presentation||Integrated data-driven storytelling||Guidelines for the final project|
Notes: the code examples are for educational purposes and in-class demonstrations only. Since the webpages for harvesting are subject to change, the codes presented in this course may not work as always.
Representative Students' Works
|Individual assignment 01||Screen scraping: scraping information from a single webpage or multiple webpages, and storing the information into a machine-readable "spreadsheet" format||Link|
|Individual assignment 02||Data processing and data exploration for text and numerical data||Link|
|Group exercise 01||"A thought experiment": converting a text-based story into a data-driven news reports ("datafication")||Link|
|Group exercise 02||Another "thought experiment": converting a data-driven news report into a text one ("de-datafication")||Link|
|Final project||Integrated data-driven storytelling projects||Link|
About the Instructor
- Xinzhi Zhang (MA. & Ph.D., City University of Hong Kong, 2013) is a Research Assistant Professor at the Department of Journalism at the Hong Kong Baptist University (the official page). His research interests include digital media and social change, comparative political communication, digital media and public health, and the social implications of big data technologies and AI algorithms. He is also an observer of computational social science and digital humanities. His research work has appeared in peer-reviewed journals such as International Political Science Review, Computers in Human Behavior, International Journal of Communication, and Digital Journalism. He currently serves as the Programme Director of Data and Media Communication concentration, an interdisciplinary concentration on data science and data-driven investigative reporting and storytelling, jointly offered by the Department of Computer Science and the Department of Journalism at HKBU.