**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Peng Yuntong
- Richard Rangel
- Muhammad Omer
- Matthew Palmer
- Shijun Li

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


The San Diego housing market has undergone significant transformations in recent years, with investors playing an increasingly prominent role. Traditional economic indicators such as employment rates, interest rates, and median household income have historically been reliable predictors of housing market movements. However, the growing presence of investors has introduced new dynamics that may affect market behavior in ways not fully captured by conventional metrics.

Recent analysis by Redfin (2024) highlights the substantial impact of investor activity in San Diego's housing market. Their research indicates that in the second quarter of 2024, investors purchased 23.7% of homes sold in San Diego, positioning the city as the second-highest in the U.S. for investor home purchases, following Miami at 28.5%. This level of investor activity represents a significant portion of market transactions and potentially influences local market dynamics. The study provides crucial baseline data about investor purchase volumes but leaves open questions about the relationship between investor activity and price movements at the zip code level.[1]

The impact of investors on local housing markets extends beyond simple purchase statistics. As reported by CBS 8 San Diego (2024), local officials and real estate experts have expressed both concern and optimism about investor activity in the market. While increased investor presence may signal long-term market strength, it also raises concerns about competition with first-time homebuyers and potential market distortions. San Diego County Supervisor Terra Lawson-Remer has highlighted the important distinction between local "mom-and-pop" landlords and large institutional investors, noting that the latter may have different impacts on community stability and housing affordability.[2]

This research aims to build upon these findings by examining the specific relationship between investor activity and housing price movements at the zip code level, providing a more granular understanding of how different types of investor activity correlate with local market dynamics. By analyzing rates of change between investor purchase percentages and home purchase prices by zip code, we can better understand whether areas with higher investment activity experience accelerated price appreciation compared to areas with less investor presence.

References:

[1] Chen, S. (2024). "Investor Home Purchases Post Biggest Increase in Two Years." Redfin News. https://www.redfin.com/news/investor-home-purchases-q2-2024/

[2] Perez, E. (2024). "Study: San Diego is #2 in U.S. for homes bought by investors." CBS 8 San Diego. https://www.cbs8.com/article/news/local/san-diego-top-for-homes-bought-by-investors/509-759b8e51-fafb-41f7-aced-e89da57670d1


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

- Data Privacy & Restriction

Data privacy concerns exist regarding property transaction records and demographic information. While most of this data is publicly available, we must ensure our analysis doesn't inadvertently reveal personally identifiable information about individual property owners or tenants. We will aggregate data at the zip code level to maintain privacy. And for those datasets without zip code, we would assign them based on the region or new index to avoid use of any personal information. Besides that, most public datasets have their own restriction policy. For example, Zillow’s market data is available for public viewing but cannot be scraped, stored, or republished without permission. Therefore, ensuring data sets that been used are authorized is another priority.

- Bias in Data Collection & Representation

There are potential biases in our dataset that need to be acknowledged and addressed. Institutional investor activity may be underreported in certain areas, and some demographic groups might be disproportionately affected by investor activity. We will Cross-reference LLC-owned property counts with proprietary databases like ATTOM, CoreLogic to identify underreported areas. and Compare distributions of race, income, and age in the dataset against 2020 U.S. Census data for San Diego County at the zip code level. For example, mark as underrepresented if Hispanic residents constitute 35% of the county population but only 20% of sampled property owners, and further consider culture difference(like tend to live with family member) to test if bias still persist.

- Detection and Mitigation of Bias

Exploratory Data Analysis (EDA) will be conducted to examine demographic distributions and missing data patterns. If biases are detected, adjustments such as re-weighting or stratified sampling will be applied.For example, if gaps exceed 5% absolute difference from Census demographics, methods would be apply to restore the balance.

- Post-Analysis Considerations

The findings of this research could have implications for housing policy and community development. We must be transparent about our methodology and careful not to draw causative conclusions where only correlative relationships exist. We will also consider the potential impact of our findings on various stakeholders, including local residents, policymakers, and market participants.

- Impact on the Market & Community

As mentioned above, since we have to keep our research transparent, we are aware that revealing areas with high investor activity can unintentionally provide insights for all kinds of market participants, which can further exacerbate affordability problems. If investors significantly raise affordability, this could lead to more displacement in lower-income neighborhoods. In order to minimize such a problem, we would hope to share findings with all stakeholders including members of the community with a vested interest in housing equity.

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal datasets and ethics; draft project proposal |
| 2/1 | 11:30 AM | Zoom meeting, ice breaker. Previous Project Review. | Briefly discussed our project’s topic. Collaborated on reviewing the previous projects |
| 2/5 | 3 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions | Further discussed our project idea. Making adjustments, making group policies. |
| 2/8 | 1 PM | Shijun Li came up with some great project topic ideas and hypotheses, however, we need further discussion | Came up with new project ideas. |
| 2/9 | 3:50 PM | For our project proposal Richard decided to do the Background, Hypothesis, and ethical problems. Peng decided to edit and work further on ethics & privacy problems and the time line proposal. | More details about who is working on each section of the project proposal. |
| 2/9 | 7 PM | Phone call and group chat discussion, talked more about the Project Proposal on github, and the future workload prediction/distribution | Discuss/edit Analysis; Complete the the final proposal|
| 2/14  | 6 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/9 | 11:59 PM | Checkpoint #2: EDA*| complete project check-in|
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion| Discuss/edit full project |
| 3/19 | 11:59 PM | Final report & video*, Team eval survey, post-course survey|
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |