## IF YOU CHOOSE THE WRONG VARIABLES TO STUDY, YOU MAY NOT END UP WITH RESULTS THAT SUPPORT MAKING BETTER DECISIONS.

### Defining variables

#### When a proper business problem or objective has been identified, you can begin to define your data. You define data by defining variables. You assign an operational definition to each variable you identify and specify the type of variable and the scale, or the type of measurement, the variable uses (the latter two concepts are discussed later in this section).

#### Example 1.1 - Defining Data at GT&M:

You have been hired by Good Tunes & More, a local electronics retailer, to assist in establishing a fair and reasonable price for Whitney Wireless, a privately-held chain that GT&M seeks to acquire. You need data that would help to analyze and verify the contents of the wireless company's basic financial statements. A GT&M manager suggests that one variable you should use is monthly sales. What do you do?

Solution:

Having first confirmed with the GT&M financial team that monthly sales is a relevant variable of interest, you develop an operational definition for this variable. Does this variable refer to sales per month for the entire chain or for individual stores? Does the variable refer to net or gross sales? Do the monthly sales data represent number of units sold or currency amounts? If the data are currency amounts, are they expressed in U.S. dollars? After getting answers to these and similar questions, you draft an operational definition for ratification by others working on this project.

#### Classifying variables by type

You need to know the type of data that a variable defines in order to choose statistical methods that are appropriate for that data. Broadly, all variables are either numerical, variables whose data represent a counted or measured quantity, or categorical, variables whose data represent categories. Gender with its categories male and female is a categorical variable, as is the variable preferred-New-Coke with its categories yes and no. In Example 1.1, the monthly sales variable is numerical because the data for this variable represent a quantity.

For some statistical methods, you must further specify numerical variables as either being discrete or continuous. Discrete numerical variables have data that arise from a counting process. Discrete numerical variables include variables that represent a “number of something,” such as the monthly number of smartphones sold in an electronics store. Continuous numerical variables have data that arise from a measuring process. The variable “the time spent waiting on a checkout line” is a continuous numerical variable because its data represent timing measurements. The data for a continuous variable can take on any value within a continuum or an interval, subject to the precision of the measuring instrument. For example, a waiting time could be 1 minute, 1.1 minutes, 1.11 minutes, or 1.113 minutes, depending on the precision of the electronic timing device used.

For some data, you might define a numerical variable for one problem that you wish to study, but define the same data as a categorical variable for another. For example, a person’s age might seem to always be a numerical variable, but what if you are interested in comparing the buying habits of children, young adults, middle-aged persons, and retirement-age people? In that case, defining age as categorical variable would make better sense.

#### Measurement Scales

You identify the measurement scale that the data for a variable represent, as part of defining a variable. The measurement scale defines the ordering of values and determines if differences among pairs of values for a variable are equivalent and whether you can express one value in terms of another. Table 1.1 presents examples of measurement scales, some of which are used in the rest of this section.

You define numerical variables as using either an interval scale, which expresses a difference between measurements that do not include a true zero point, or a ratio scale, an ordered scale that includes a true zero point. If a numerical variable has a ratio scale, you can characterize one value in terms of another. You can say that the item cost (ratio) $2 is twice as expensive as the item cost $1. However, because Fahrenheit temperatures use an interval scale, 2°F does not represent twice the heat of 1°F. For both interval and ratio scales, what the difference of 1 unit represents remains the same among pairs of values, so that the difference between $11 and $10 represents the same difference as the difference between $2 and $1 (and the difference between 11°F and 10°F represents the same as the difference between 2°F and 1°F).

Categorical variables use measurement scales that provide less insight into the values for the variable. For data measured on a nominal scale, category values express no order or ranking. For data measured on an ordinal scale, an ordering or ranking of category values is implied. Ordinal scales give you some information to compare values but not as much as interval or ratio scales. For example, the ordinal scale poor, fair, good, and excellent allows you to know that “good” is better than poor or fair and not better than excellent. But unlike interval and ratio scales, you do not know that the difference from poor to fair is the same as fair to good (or good to excellent).

### Collecting data

Collecting data using improper methods can spoil any statistical analysis. For example, Coca-Cola managers in the 1980s (see page 17) faced advertisements from their competitor publicizing the results of a “Pepsi Challenge” in which taste testers consistently favored Pepsi over Coke. No wonder—test recruiters deliberately selected tasters they thought would likely be more favorable to Pepsi and served samples of Pepsi chilled, while serving samples of Coke lukewarm (not a very fair comparison!). These introduced biases made the challenge anything but a proper scientific or statistical test. Proper data collection avoids introducing biases and minimizes errors.

### Populations and Samples

You collect your data from either a population or a sample. A population contains all the items or individuals of interest that you seek to study. All the GT&M sales transactions for a specific year, all the full-time students enrolled in a college, and all the registered voters in Ohio are examples of populations. A sample contains only a portion of a population of interest. You analyze a sample to estimate characteristics of an entire population. You might select a sample of 200 GT&M sales transactions, a sample of 50 full-time students selected for a marketing study, or a sample of 500 registered voters in Ohio in lieu of analyzing the populations identified in this paragraph.
You collect data from a sample when selecting a sample will be less time-consuming or less cumbersome than selecting every item in the population or when analyzing a sample is less cumbersome or more practical than analyzing the entire population. Section FTF.3 defines statistic as a “value that summarizes the data of a specific variable.” More precisely, a statistic summarizes the value of a specific variable for sample data. Correspondingly, a parameter summarizes the value of a population for a specific variable.


### Data Sources

Data sources arise from the following activities:

• Capturing data generated by ongoing business activities
• Distributing data compiled by an organization or individual
• Compiling the responses from a survey
• Conducting a designed experiment and recording the outcomes of the experiment
• Conducting an observational study and recording the results of the study

When you perform the activity that collects the data, you are using a primary data source. When the data collection part of these activities is done by someone else, you are using a secondary data source.
Capturing data can be done as a byproduct of an organization’s transactional information processing, such as the storing of sales transactions at a retailer such as GT&M, or as result of a service provided by a second party, such as customer information that a social media website business collects on behalf of another business. Therefore, such data capture may be either a primary or a secondary source.

Typically, organizations such as market research firms and trade associations distribute complied data, as do businesses that offer syndicated services, such as The Neilsen Company, known for its TV ratings. Therefore, this source of data is usually a secondary source. The other three sources are either primary or secondary, depending on whether you (your organization) are doing the activity. For example, if you oversee the distribution of a survey and the compilation of its results, the survey is a primary data source.

In both observational studies and designed experiments, researchers that collect data are looking for the effect of some change, called a treatment, on a variable of interest. In an observational study, the researcher collects data in a natural or neutral setting and has no direct control of the treatment. For example, in an observational study of the possible effects on theme park usage patterns (the variable of interest) that a new electronic payment method might cause, you would take a sample of visitors, identify those who use the new method and those who do not, and then “observe” if those who use the new method have different park usage patterns. In a designed experiment, you permit only those you select to use the new electronic payment method and then discover if the people you selected have different theme park usage patterns (from those who you did not select to use the new payment method). Choosing to use an observational study (or experiment) affects the statistical methods you apply and the decision-making processes that use the results of those methods, as later chapters (10, 11, and 17) will further explain.

### Types of Sampling Methods

When you collect data by selecting a sample, you begin by defining the frame. The frame is a complete or partial listing of the items that make up the population from which the sample will be selected. Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population. Using different frames to collect data can lead to different, even opposite, conclusions.
Using your frame, you select either a probability sample or a probability sample. In a probability sample, you select the items or individuals without knowing their probabilities of selection. In a probability sample, you select items based on known probabilities. Whenever possible, you should use a probability sample as such a sample will allow you to make inferences about the population being analyzed.
Non-probability samples can have certain advantages, such as convenience, speed, and low cost. Such samples are typically used to obtain informal approximations or as small-scale initial or pilot analyses. However, because the theory of statistical inference depends on probability sampling, probability samples cannot be used for statistical inference and this more than offsets those advantages in more formal analyses.
Figure 1.1 shows the subcategories of the two types of sampling. A probability sample can be either a convenience sample or a judgment sample. To collect a convenience sample, you select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse of stacked items, selecting only the items located on the top of each stack and within easy reach would create a convenience sample. So, too, would be the responses to surveys that the web-sites of many companies offer visitors. While such surveys can provide large amounts of data quickly and inexpensively, the convenience samples selected from these responses will consist of self-selected website visitors. (Read the Consider This essay on page 29 for a related story.)
To collect a judgment sample, you collect the opinions of preselected experts in the subject-matter. Although the experts may be well-informed, you cannot generalize their results to the population.
The types of probability samples most commonly used include simple random, systematic, stratified, and cluster samples. These four types of probability samples vary in terms of cost, accuracy, and complexity, and they are the subject of the rest of this section.

### Simple Random Sample

In a simple random sample, every item from a frame has the same chance of selection as every other item, and every sample of a fixed size has the same chance of selection as every other sample of that size. Simple random sampling is the most elementary random sampling technique. It forms the basis for the other random sampling techniques. However, simple random sampling has its disadvantages. Its results are often subject to more variation than other sampling methods. In addition, when the frame used is very large, carrying out a simple random sample may be time consuming and expensive.

With simple random sampling, you use n to represent the sample size and N to represent the frame size. You number every item in the frame from 1 to N. The chance that you will select any particular member of the frame on the first selection is 1/N.
You select samples with replacement or without replacement. Sampling with replacement means that after you select an item, you return it to the frame, where it has the same probability of being selected again. Imagine that you have a fishbowl containing N business cards, one card for each person. On the first selection, you select the card for Grace Kim. You record pertinent information and replace the business card in the bowl. You then mix up the cards in the bowl and select a second card. On the second selection, Grace Kim has the same probability of being selected again, 1/N.You repeat this process until you have selected the desired sample size, n.
Typically, you do not want the same item or individual to be selected again in a sample. Sampling without replacement means that once you select an item, you cannot select it again. The chance that you will select any particular item in the frame—for example, the business card for Grace Kim—on the first selection is 1/N. The chance that you will select any card not previously chosen on the second selection is now 1 out of N - 1. This process continues until you have selected the desired sample of size n.

When creating a simple random sample, you should avoid the “fishbowl” method of selecting a sample because this method lacks the ability to thoroughly mix the cards and, therefore, randomly select a sample. You should use a more rigorous selection method.
One such method is to use a table of random numbers, such as Table E.1 in Appendix E, for selecting the sample. A table of random numbers consists of a series of digits listed in a randomly generated sequence. To use a random number table for selecting a sample, you first need to assign code numbers to the individual items of the frame. Then you generate the random sample by reading the table of random numbers and selecting those individuals from the frame whose assigned code numbers match the digits found in the table. Because the number system uses 10 digits (0, 1, 2, ... , 9), the chance that you will randomly generate any particular digit is equal to the probability of generating any other digit. This probability is 1 out of 10. Hence, if you generate a sequence of 800 digits, you would expect about 80 to be the digit 0, 80 to be the digit 1, and so on. Because every digit or sequence of digits in the table is random, the table can be read either horizontally or vertically. The margins of the table designate row numbers and column numbers. The digits themselves are grouped into sequences of five in order to make reading the table easier.

### Systematic Sample

In a systematic sample, you partition the N items in the frame into n groups of k items, where
k = N/n

You round k to the nearest integer. To select a systematic sample, you choose the first item to be selected at random from the first k items in the frame. Then, you select the remaining n - 1 items by taking every kth item thereafter from the entire frame.
If the frame consists of a list of pre-numbered checks, sales receipts, or invoices, taking a systematic sample is faster and easier than taking a simple random sample. A systematic sample is also a convenient mechanism for collecting data from membership directories, electoral registers, class rosters, and consecutive items coming off an assembly line.

To take a systematic sample of n = 40 from the population of N = 800 full-time employees, you partition the frame of 800 into 40 groups, each of which contains 20 employees. You then select a random number from the first 20 individuals and include every twentieth individual after the first selection in the sample. For example, if the first random number you select is 008, your subsequent selections are 028, 048, 068, 088, 108, ... , 768, and 788.
Simple random sampling and systematic sampling are simpler than other, more sophisticated, probability sampling methods, but they generally require a larger sample size. In addition, systematic sampling is prone to selection bias that can occur when there is a pattern in the frame. To overcome the inefficiency of simple random sampling and the potential selection bias involved with systematic sampling, you can use either stratified sampling methods or cluster sampling methods.

### Stratified Sample

In a stratified sample, you first subdivide the N items in the frame into separate subpopulations, or strata. A stratum is defined by some common characteristic, such as gender or year in school. You select a simple random sample within each of the strata and combine the results from the separate simple random samples. Stratified sampling is more efficient than either simple random sampling or systematic sampling because you are ensured of the representation of items across the entire population. The homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters. In addition, stratified sampling enables you to reach conclusions about each strata in the frame. However, using a stratified sample requires that you can determine the variable(s) on which to base the stratification and can also be expensive to implement.

### Cluster Sample

In a cluster sample, you divide the N items in the frame into clusters that contain several items. Clusters are often naturally occurring groups, such as counties, election districts, city blocks, households, or sales territories. You then take a random sample of one or more clusters and study all items in each selected cluster.
Cluster sampling is often more cost-effective than simple random sampling, particularly if the population is spread over a wide geographic region. However, cluster sampling often requires a larger sample size to produce results as precise as those from simple random sampling or stratified sampling. A detailed discussion of systematic sampling, stratified sampling, and cluster sampling procedures can be found in references 2, 4, and 6.

### Data Cleaning

Even if you follow proper procedures to collect data, that data you collect may contain incorrect or inconsistent data that could affect statistical results. Data cleaning corrects such defects and ensures your data contains suitable quality for your needs. Cleaning is the most important data preprocessing task you do and must be done before using your data for analysis. Cleaning can take a significant amount of time to do. One survey of big data analysts reported that they spend 60% of their time cleaning data, while only 20% of their time collecting data and a similar per- centage for analyzing data (see reference 8).
Data cleaning seeks to correct the following types of irregularities:
• Invalid variable values, including
§ Non-numerical data for a numerical variable
§ Invalid categorical values of a categorical variable § Numeric values outside a defined range
• Coding errors, including
§ Inconsistent categorical values
§ Inconsistent case for categorical values § Extraneous characters
• Data integration errors, including § Redundant columns
§ Duplicated rows
§ Differing column lengths
§ Different units of measure or scale for numerical variables
By its nature, data cleaning cannot be a fully automated process, even in large business systems that contain data cleaning software components. As this chapter’s software guides explain, Excel, JMP, and Minitab have functionality that you can use to lessen the burden of data cleaning. When performing data cleaning, you always first preserve a copy of the original data for later reference.

### Invalid Variable Values

Invalid variable values can be identified as being incorrect by simple scanning techniques so long as operational definitions for the variables the data represent exist. For any numerical variable, any value that is not a number is clearly an incorrect value. For a categorical variable, a value that does not match any of the predefined categories of the variable is, likewise, clearly an incorrect value. And for numerical variables defined with an explicit range of values, a value outside that range is clearly an error.
You will most likely semi-automate the finding of invalid variable values and can use various features of Excel, JMP, or Minitab to assist you in this task.

### Coding Errors

Coding errors can result from poor recording or entry of data values or as the result of computerized operations such as copy-and-paste or data import. While coding errors are literally invalid values, coding errors may be correctable without consulting additional information whereas the invalid variable values never are. For example, for a Gender variable with the defined values F and M, the value “Female” is a coding error that can be reasonably changed to F. However, the value “New York” for the same variable is an invalid variable value that you cannot reasonably change to either F or M.
Unlike invalid variable values, coding errors may be tolerated by analysis software. For example, for the same Gender variable, the values M and m might be treated as the “same” value for purposes of an analysis by software that was tolerant of case inconsistencies, an attribute known as being insensitive to case.
Perhaps the most frustrating coding errors are extraneous characters in a value. You may not be able to spot extraneous characters such as non-printing characters or extra, trailing space characters as you scan data. For example, the value David and the value that is David followed by three space characters may look the same to you as you scan data but may not be treated the same by software. Likewise, values with non-printing characters may look correct but cause software errors or be reported as invalid by analysis software.

### Data Integration Errors

Data integration errors arise when data from two different computerized sources, such as two different data repositories are combined into one data set for analysis. Identifying data integration errors may be the most time-consuming data cleaning task. Because spotting these errors requires a type of data interpretation that automated processes of a typical business computer systems today cannot supply, you will most likely be spotting these errors using manual means in the foreseeable future.
Some data integration errors occur because variable names or definitions for the same item of interest have minor differences across systems. In one system, a customer ID number may be known as Customer ID, whereas in a different system, the same fact is known as Cust Number. A result of combining data from the two systems may result in having both Customer ID and Cust Number variable columns, a redundancy that should be eliminated.
Duplicated rows also occur because of similar inconsistencies across systems. Consider a Customer Name variable with the value that represents the first coauthor of this book, Mark L. Berenson. In one system, this name may have been recorded as Mark Berenson, whereas in another system, the name was recorded as M L Berenson. Combining records from both systems may result in two records, where only one should exist. Whether “Mark Berenson” is actually the same person as “M L Berenson” requires an interpretation skill that today’s software may lack.
Likewise, different units of measurement (or scale) may not be obvious without additional, human interpretation. Consider the variable Air Temperature, recorded in degrees Celsius in one system and degrees Fahrenheit in another. The value 30 would be a plausible value under either measurement system and without further knowledge or context impossible to spot as a Celsius measurement in a column of otherwise Fahrenheit measurements.

### Missing Values

Missing values are values that were not collected for a variable. For example, survey data may include answers for which no response was given by the survey taker. Such “no responses” are examples of missing values. Missing values can also result from integrating two data sources which do not have a row-to-row correspondence for each row in both sources. The lack of cor- respondence creates particular variable columns to be longer, to contain additional rows than the other columns. For these additional rows, missing would be the value for the cells in the shorter columns.
Do not confuse missing values with miscoded values. Unresolved miscoded values—values that cannot be cleaned by any method—might be changed to missing by some researchers or excluded for analysis by others.

### Algorithmic Cleaning of Extereme Numerical Values

For numerical variables without a defined range of possible values, you might find outliers, values that seem excessively different from most of the other values. Such values may or may not be errors, but all outliers require review. While there is no one standard for defining outliers, most define outliers in terms of descriptive measures such as the standard deviation or the interquartile range that Chapter 3 discusses. Because software can compute such measures, spotting outliers can be automated if a definition of the term that uses a such a measure is used. As later chapters note as appropriate, identifying outliers is important as some methods are sensitive to outliers and produce very different results when outliers are included in analysis.

## Other Data Preprocessing Tasks

In addition to data cleaning, there are several other data preprocessing tasks that you might
undertake before visualizing and analyzing your data.

### Data Formatting

You may need to reformat your data when you collect your data. Reformatting can mean rearranging the structure of the data or changing the electronic encoding of the data or both. For example, suppose that you seek to collect financial data about a sample of companies. You might find these data structured as tables of data, as the contents of standard forms, in a continuous stock ticker stream, or as messages or blog entries that appear on various websites. These data sources have various levels of structure which affect the ease of reformatting them for use.

Because tables of data are highly structured and are similar to the structure of a worksheet, tables would require the least reformatting. In the best case, you could make the rows and columns of a table the rows and columns of a worksheet. Unstructured data sources, such as messages and blog entries, often represent the worst case. You may have to paraphrase or characterize the message contents in a way that does not involve a direct transfer. As the use of business analytics grows (see Chapter 17), the use of automated ways to paraphrase or characterize these and other types of unstructured data grows, too.

Independent of the structure, the data you collect may exist in an electronic form that needs to be changed in order to be analyzed. For example, data presented as a digital picture of Excel worksheets would need to be changed into an actual Excel worksheet before that data could be analyzed. In this example, you are changing the electronic encoding of all the data, from a picture format such as jpeg to an Excel workbook format such as xlsx. Sometimes, individual numerical values that you have collected may need to changed, especially if you collect values that result from a computational process. You can demonstrate this issue in Excel by entering a formula that is equivalent to the expression 1 * (0.5 - 0.4 - 0.1), which should evaluate as 0 but in Excel evaluates to a very small negative number. You would want to alter that value to 0 as part of your data cleaning.

### Stacking and Unstacking Data

When collecting data for a numerical variable, you may need to subdivide that data into two or more groups for analysis. For example, if you were collecting data about the cost of a restaurant meal in an urban area, you might want to consider the cost of meals at restaurants in the center city district separately from the meal costs at metro area restaurants. When you want to consider two or more groups, you can arrange your data as either unstacked or stacked.

To use an unstacked arrangement, you create separate numerical variables for each group. In the example, you would create a center city meal cost variable and a second variable to hold the meal costs at metro area restaurants. To use a stacked arrangement format, you pair the single numerical variable meal cost with a second, categorical variable that contains two categories, such as center city and metro area. If you collect several numerical variables, each of which you want to subdivide in the same way, stacking your data will be the more efficient choice for you.

When you use software to analyze data, you may discover that a particular procedure requires data to be stacked (or unstacked). When such cases arise using Microsoft Excel, JMP, or Minitab for problems or examples that this book discusses, a workbook or project will contain that data in both arrangements. For example, Restaurants , that Chapter 2 uses for several examples, contains both the original (stacked) data about restaurants as well as an unstacked worksheet (or data table) that contains the meal cost by location, center city or metro area.

### Recoding Variables

After you have collected data, you may need to reconsider the categories that you defined for a categorical variable or transform a numerical variable into a categorical variable by assign- ing the individual numeric values to one of several groups. In either case, you can define a recoded variable that supplements or replaces the original variable in your analysis.

For example, having already defined the variable class standing with the categories fresh- man, sophomore, junior, and senior, you decide that you want to investigate the differences between lowerclassmen (freshmen or sophomores) and upperclassmen (juniors or seniors). You can define a recoded variable UpperLower and assign the value Upper if a student is a junior or senior and assign the value Lower if the student is a freshman or sophomore.

When recoding variables, make sure that one and only one of the new categories can be assigned to any particular value being recoded and that each value can be recoded successfully by one of your new categories. You must ensure that your recoding has these properties of being mutually exclusive and collectively exhaustive.

When recoding numerical variables, pay particular attention to the operational definitions of the categories you create for the recoded variable, especially if the categories are not self- defining ranges. For example, while the recoded categories Under 12, 12–20, 21–34, 35–54, and 55-and-over are self-defining for age, the categories child, youth, young adult, middle aged, and senior each need to be further defined in terms of mutually exclusive and collectively exhaustive numerical ranges.


## Types of Survey Errors

When you collect data using the compiled responses from a survey, you must verify two things about the survey in order to make sure you have results that can be used in a decision-making process. You must evaluate the validity of the survey to make sure the survey does not lack objectivity or credibility. To do this, you evaluate the purpose of the survey, the reason the survey was conducted, and for whom the survey was conducted.

Having validated the objectivity and credibility of such a sample, you then determine if the survey was based on a probability sample (see Section 1.3). Surveys that use nonprobability sam- ples are subject to serious biases that make their results useless for decision-making purposes. In the case of the Coca-Cola managers concerned about the “Pepsi Challenge” results (see page 17), the managers failed to reflect on the subjective nature of the challenge as well as the nonprobability sample that this survey used. Had the managers done so, they might not have been so quick to make the reformulation blunder that was reversed just weeks later.

Even when you verify these two things, surveys can suffer from any combination of the fol- lowing types of survey errors: coverage error, nonresponse error, sampling error, or measurement error. Developers of well-designed surveys seek to reduce or minimize these types of errors, often at considerable cost.

### Coverage Error

The key to proper sample selection is having an adequate frame. Coverage error occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample or if items are included from outside the frame. Coverage error results in a selection bias. If the frame is inadequate because certain groups of items in the population were not properly included, any probability sample selected will provide only an estimate of the characteristics of the frame, not the actual population.


### Nonresponse Error

Not everyone is willing to respond to a survey. Nonresponse error arises from failure to collect data on all items in the sample and results in a nonresponse bias. Because you cannot always assume that persons who do not respond to surveys are similar to those who do, you need to follow up on the nonresponses after a specified period of time. You should make several attempts to convince such individuals to complete the survey and possibly offer an incentive to partici- pate. The follow-up responses are then compared to the initial responses in order to make valid inferences from the survey (see references 2, 4, and 6). The mode of response you use, such as face-to-face interview, telephone interview, paper questionnaire, or computerized questionnaire, affects the rate of response. Personal interviews and telephone interviews usually produce a higher response rate than do mail surveys—but at a higher cost.

### Sampling Error

When conducting a probability sample, chance dictates which individuals or items will or will not be included in the sample. Sampling error reflects the variation, or “chance differences,” from sample to sample, based on the probability of particular individuals or items being selected in the particular samples.
When you read about the results of surveys or polls in newspapers or on the Internet, there is often a statement regarding a margin of error, such as “the results of this poll are expected to be within {4 percentage points of the actual value.” This margin of error is the sampling error. You can reduce sampling error by using larger sample sizes. Of course, doing so increases the cost of conducting the survey.

### Measurement Error

In the practice of good survey research, you design surveys with the intention of gathering mean- ingful and accurate information. Unfortunately, the survey results you get are often only a proxy for the ones you really desire. Unlike height or weight, certain information about behaviors and psychological states is impossible or impractical to obtain directly.
When surveys rely on self-reported information, the mode of data collection, the respon- dent to the survey, and or the survey itself can be possible sources of measurement error. Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent on the mode of data collection. The social desirability bias or cognitive/memory limitations of a respondent can affect the results. And vague questions, double-barreled questions that ask about multiple issues but require a single response, or questions that ask the respondent to report something that occurs over time but fail to clearly define the extent of time about which the question asks (the reference period) are some of the survey flaws that can cause errors.
To minimize measurement error, you need to standardize survey administration and respon- dent understanding of questions, but there are many barriers to this (see references 1, 3, and 12)

### Ethical Issues About Surveys

Ethical considerations arise with respect to the four types of survey error. Coverage error can result in selection bias and becomes an ethical issue if particular groups or individuals are purposely excluded from the frame so that the survey results are more favorable to the survey’s sponsor. Nonresponse error can lead to nonresponse bias and becomes an ethical issue if the sponsor knowingly designs the survey so that particular groups or individuals are less likely than others to respond. Sampling error becomes an ethical issue if the find- ings are purposely presented without reference to sample size and margin of error so that the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading questions that guide the respondent in a particular direction; (2) an interviewer, through mannerisms and tone, purposely makes a respondent obligated to please the interviewer or otherwise guides the respondent in a particular direction; or (3) a respondent willfully provides false information.
Ethical issues also arise when the results of nonprobability samples are used to form con- clusions about the entire population. When you use a nonprobability sampling method, you need to explain the sampling procedures and state that the results cannot be generalized beyond the sample.
