#Introduction to Data Science for Librarians

## Chapter 3 - Data and Data Management

### Chapter 3 Contents

3.1 Types of data

3.2 Collecting data from human subjects

3.3 Sources of existing data

3.4 Data management

3.6 Data ethics

3.6 Conclusion



### 3.1 Types of data

In chapter 2, you learned about data and types of data used in Python programming. In this chapter we approach the idea of data and data types in a slightly different perspective. The approach here is broader. If you think of Python and other programming languages as "tools" for data science, then the approach to data taking in this chapter could be said to be tool agnostic, that is we will examine some universal meanings for and properties of data that do not change depending on the tool being used.

The approach to data taken in this chapter is also a philosophical approach. The researcher's conception of data and how to use it to answer research questions influences their choice of data and methods of data analysis.

If you've had a research methods course or read a journal article that reports the results of a research project, you may have encountered the terms qualitative and/or quantitative research. Those terms refer to both the type of data used in a research project and the philosophical approach to the research, which, in turn, influences the choice of data type. In this section we'll cover the properties (or characteristics) of different kinds of data.

Qualitative research data are nonnumeric. They may consist of alphabetic characters, sound, images, or other types of media. Qualitative research consists of "analysing the subjective meaning or the social production of issues, events, or practices by collecting non-standardized data and analysing texts and images (Flick, 2022).

Philosophically, qualitative researchers bring to their work the belief that in order to understand a particular phenomenon, it must be studied in it's natural context. They believe that context must be taken into account in any conclusions that are drawn as a result or research.

Quantitative research data are numeric. They usually measurements of "events or practices by collecting standardized data and using numbers and statistics for analysing them" (Flick, 2022). They may measure characteristics of a research subject that allow for comparison, ranking, and calculating mathematical relationships.

Philosophically, many quantitative reseachers believe that a phenomenon may be understood through observation and that "the observation of data should be separated from the interpretation of their meanings. Truth is to be found by following general rules of method, largely independent of the content and context of the investigation" (Flick, 2022).

Although some believe that qualitative reseach and quantitative research are contradictory, many others believe that they are compatible and can be combined in compliment to one another to produce good results. Data science encompases both types of research and data, often in very large quantitites. For instance, techniques such as sentiment analysis make use of vast corpuses of texts to analyze arguably subjective results may be combined with statistical techniques applied to quantitative data. In the exercizes in this book we will work mainly with numerical, quantitative data.

### 3.2 Collecting data from human subjects

All data collection is subject to rules and best practices, but data collected from human subjects is arguably subject to the greatest number of rules, ethical principles, and laws. In the U.S., the Office of Human Research Protections is part of the Department of Health and Human Services (HHS) that is responsible for overseeing the U.S. law related to human subjects research found in the [Code of Federal Regulations at 45 C.F.R. Part 46](https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html). These laws are intended to protect humans who participate in research from harm that might arise as a result of that participation.

On the local level, that is at universities as well as hospitals and other research facilities, the law requires that human subjects research be approved by an Institutional Review Board (IRB). The role of the IRB, which is also sometimes called Ethics Review Committee, is to "review research studies to ensure that they comply with applicable regulations, meet commonly accepted ethical standards, follow institutional policies, and adequately protect research participants" (HHS, 2021). Before begining a research project that involves humans as its subject, researchers must submit a proposal for human subjects research to their local IRB and receive approval. Failing to obtain IRB approval for human subjects research can result in severe penalities for both the researcher and the insitution.

There is human subjects research training available from a variety of sources including free from [HHS](https://colab.research.google.com/drive/1vnRLR3ijRj6_cPQ_81hLP65KkhUqPOZw?authuser=2#scrollTo=rDSq7bmZjgxr&line=7&uniqifier=1), institutional training provided by an IRB itself, and fee based training such as that provided by the Collaborative Institutional Training Initiative ([CITI Program](https://about.citiprogram.org/)). Most college and university IRBs require that researchers complete some kind of training before submitting an application to do human subjects research.

Some research data pertaining to humans does not require IRB approval, that includes existing data sets such as U.S. Census data, which we will use in some of the exercizes in this book. U.S. Census data is freely available. Sources of existing data, from human subjects and other kinds of data, are the subject of the next section.


## 3.3 Sources of existing data

Over the past few decades, as increases in computer storage and decreases in the cost of that storage occured, it has become more and more common for researchers to archive their raw data and, in some cases, make it available to other researchers. In fact, in some cases, grant funding agencies may require that research data from a funded research project be archived and shared. In other cases, data that are collected by a government agency, such as the weather data used in Chapter 2, are made freely available. College and university libraries are often responsible for maintaining archives of raw data sets created by the institutional scholars and researchers they support as part of their institutional repositories. In all cases of existing data sets, special care is taken by researchers and by those responsible for maintaining the repositories and archives storing the data to remove any trace of information that might personally identify individual human subjects of research.

In addition to being responsible for archiving data, libraries and librarians may also have a role in assisting researchers to find existing data sets. In this section, we'll take a look at a few sources of existing data available at the time this book was written.

####*National Weather Service*
The  [U.S. National Weather Service](https://www.weather.gov/) provides historical climate data such as that which was used in chapter 2. Historical climate data is available for many U.S. cities.

###*U.S. Census Data: American Community Survey*
"The American Community Survey (ACS) is an annual demographics survey program conducted by the United States Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, including ancestry, US citizenship status, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics" ("American Community Survey," 2025). Libraries may use ACS data to understand the populations they serve. ACS data are used in this course.

###*Inter-university Consortium for Political and Social Research (ICPSR)*

"The Inter-university consortium for Political and Social Research is an American political science and social science research consortium based at the University of Michigan." It is a repository for digital data sets as well as a teaching institution and source of raw data for research. They offer an intensive summer program in data analysis.

###*DMP Tool*
"The DMP Tool is a free, open-source, application that helps researchers create data management plans (DMPs). These plans are required by many funding agencies as part of the grant proposal submission process. The DMP Tool provides a click-through wizard for creating a DMP that complies with funder requirements. It also has direct links to funder websites, help text for answering questions, and data management best practices resources" (DMP Tool, 2024). It also includes a wide variety of public data plans.
___


In order to use existing data sets successfully, it is very important to understand the original purpose of existing data sets, including the definitions of all variables, data collection methods, and how data are organized. Most downloadable data sets are located on web pages that also contain these kinds of information.

## 3.4 Data Management

Data management refers to the curation and preservation of data in a variety of contexts including the business world, public and private research institutes, and libraries. It is part of the data life cycle, which defines the stages as the activities that are performed on data such as collection, description, cleaning and validation, analysis, summarization, storage, and dissemination (Johnson, 2019). The research data lifecycle is a model of the typical activities and order of activities a researcher might undertake as the progress through doing research. There are multiple versions of the research data lifecyle that use a variety of terms for each stage.

Librarians are somewhat unique in the roles that they play in data management. They may be themselves conducting research, working through the stages of the research data lifefcycle. But they may be consulting with scholars, sometimes assisting them to find existing data sets, often assisting them to use ethical practices for storing, preserving, and disseminating data. In recent years, the latter role has come to be described as data services librarian. Data services librarians typically provide "services that address the full data lifecycle, including the data management plan, digital curation (selection, preservation, maintenance, and archiving), and metadata creation and conversion" (Tenopir, et al., 2013) as well as "data mining, metadata knowledge, technical details of repository hardware and software, programming and software expertise, legal and policy knowledge, library instruction, and research consultations" (Fuhr, 2022). In this book we will focus on the librarian's role as researcher so that we may include in the term "librarian" information professionals working in a wide variety of libraries and information agencies. But the activities, experience, and knowledge gained by this will be of as great use to data services librarians as it is to librarian researchers.

###*DMP Tool*
"The DMP Tool is a free, open-source, application that helps researchers create data management plans (DMPs). These plans are required by many funding agencies as part of the grant proposal submission process. The DMP Tool provides a click-through wizard for creating a DMP that complies with funder requirements. It also has direct links to funder websites, help text for answering questions, and data management best practices resources" (DMP Tool, 2024). It also includes a wide variety of public data plans.

## 3.5 Data Ethics

Researchers have ethical responsibilities not only to the human subjects of research, but also to their data, whether it is collected from human subjects or not. The Federal Data Ethics Tenets were created in the U.S. in the early 2020s to "help federal data users make decisions ethically and promote accountability throughout the data
lifecycle—as data are acquired, processed, disseminated, used, stored and disposed. Regardless of data type or use, those working with data in the public sector should have a foundational understanding of the Data Ethics Tenets" (Office of Management and Budget, n.d., p.2). They define data ethics as "Data Ethics are the norms of behavior that promote appropriate judgments and accountability when acquiring, managing, or using data, with the goals of protecting civil liberties, minimizing risks to individuals and society, and maximizing the public good" (Office of Management and Budget, n.d., p.9). The tenets are:

1. Uphold Applicable Statutes, Regulations, Professional Practices, and Ethical Standards
2. Respect the Public, Individuals, and Communities
3. Respect Privacy and Confidentiality
4. Act with Honesty, Integrity, and Humility
5. Hold Oneself and Others Accountable
6. Promote Transparency
7. Stay Informed of Developments in the Fields of Data Management and Data Science ((Office of Management and Budget, n.d., p.10).

The tenets may seem somewhat familiar since they echo the principles upon which ethical human subjects research practice are based. They may also seem familiar to those who use Severson's (1997) *Principles of Information Ethics.* The practices associated with ethical behavior toward reseach data may vary across acadmic disciplines, but their importance is universal.

Some aspects of data ethics that are of special importance to data and informaton scientists are:
* Privacy - Protecting sensitive information and ensuring individuals have control over their own data (Pundhir, 2023).
* Fairness - Avoiding discrimination and ensuring that data-driven decisions are equitable for all groups ("Ethical considerations in data science," 2023).
* Transparency: Being open about how data is collected, proessed, and used (Trivizakis & Marias, 2023).
* Bias: Identifying and mitigating biases in data and alorithms that can leasd to unfair or inaccurate outcomes (Khanayat, 2023).
* Accountability: Estabishing responsibility for the ethical implications of data-driven decisions (University of New South Whales, 2024).
* Consent: Obtaining informed consent before collecting or using personal data ("7 data ethics examples you must know in 2025," 2024).

In the context of this book and the exercizes it includes, we will exercize ethical behavior toward the data we use specifically by respecting the rules for data scraping that web site require, by using data that does not impinge on the privacy rights of individuals, and by seeking to identify and mitigate bias we observe in the data and in our results.


## 3.6 Conclusion

In this chapter we have learned quite a lot about data. We have learned about kinds of data, numerical and non-numerical. We have learned about the special rules for collecting and protecting data collected from human subjects of research. We have learned about the collection and of data that already exists. We have learned about the necessity for managing data and planning for that management. We have learned about expectations that we will collect and use data following ethical best practices.  What we have learned we will apply to working with data in later chapters.

##References
7 data ethics examples you must know in 2025. (2024, December 26). Atlan. https://atlan.com/data-ethics-examples/

American Community Survey. (2025). In *Wikipedia*. https://en.wikipedia.org/wiki/American_Community_Survey

DMP Tool. (2024). https://dmptool.org/about_us

Ethical considerations in data science: Privacy, bias, and fairness. (2023, August 7). *Medium.* https://skillfloor.medium.com/ethical-considerations-in-data-science-privacy-bias-and-fairness-d89b0bd12bd3

Flick, U. (2022). *An introduction to qualitative research* (7th ed.). Sage.

Furh, J. (2022). Developing data services skills in academic libraries. *College & Research Libraries, 83,*(3). https://crl.acrl.org/index.php/crl/article/view/24863/33324#:~:text=One%20new%20area%20of%20academic,time%20spent%20providing%20data%20services.

ICPSR. (2025). In *Wikipedia*. https://en.wikipedia.org/wiki/Inter-university_Consortium_for_Political_and_Social_Research

Khanayat, P. (2023, December 13). Ethics in data analytics: Key ethical guidelines. https://suozziforny.com/ethics-in-data-analytics

Office of Management and Budget. (n.d.) *Federal data S]strategy: Data ethics framework.* https://resources.data.gov/assets/documents/fds-data-ethics-framework.pdf

Pundhir, A. (2023, August 5). Data ethics: Navigating the moral landscape in the world of AI. *Medium.* https://ajaypundhir.medium.com/data-ethics-navigating-the-moral-landscape-of-data-science-1ba47a9584b3

Tenopir, C., Sandusky, R., Allard, S., and Birch, B. (2013). Academic librarians and research data services: Preparation and attitudes. *IFLA Journal, 39*(1). doi10.1177/0340035212473089.

Trivizakis, E., & Marias, K. (2023). Deep learning fundamentals. In M.E. Klontzas, S.C. Fanni, & E. Neri (Eds.), *Introduction to artificial intelligence* (pp. 101-131. Springer. https://doi.org/10.1007/978-3-031-25928-9_6

University of New South Whales. (2024, May 29). Data ethics: Examples, principles, and uses. UNSW Online. https://studyonline.unsw.edu.au/blog/data-ethics-overview

U.S. Census Bureau. (2023). *Using the American Community Survey Table-Based Summary File.* https://www.census.gov/content/dam/Census/library/publications/2023/acs/acs_table_based_summary_file_handbook.pdf

U.S. Department of Health and Human Services (HHS). (2021). *Lesson 3: What are IRBs?.* Retrieved April 22, 2025 from https://www.hhs.gov/ohrp/education-and-outreach/online-education/human-research-protection-training/lesson-3-what-are-irbs/index.html
